# My Data Science Blogs

## July 20, 2018

### Simplify Advertising Analytics Click Prediction with Databricks Unified Analytics Platform

Advertising teams want to analyze their immense stores and varieties of data requiring a scalable, extensible, and elastic platform.  Advanced analytics, including but not limited to classification, clustering, recognition, prediction, and recommendations allow these organizations to gain deeper insights from their data and drive business outcomes. As data of various types grow in volume, Apache Spark provides an API and distributed compute engine to process data easily and in parallel, thereby decreasing time to value.  The Databricks Unified Analytics Platform provides an optimized, managed cloud service around Spark, and allows for self-service provisioning of computing resources and a collaborative workspace.

Let’s look at a concrete example with the Click-Through Rate Prediction dataset of ad impressions and clicks from the data science website Kaggle.  The goal of this workflow is to create a machine learning model that, given a new ad impression, predicts whether or not there will be a click.

To build our advanced analytics workflow, let’s focus on the three main steps:

• ETL
• Data Exploration, for example, using SQL
• Advanced Analytics / Machine Learning

## Building the ETL process for the advertising logs

First, we download the dataset to our blob storage, either AWS S3 or Microsoft Azure Blob storage.  Once we have the data in blob storage, we can read it into Spark.

%scala
.option("inferSchema", true)


This creates a Spark DataFrame – an immutable, tabular, distributed data structure on our Spark cluster. The inferred schema can be seen using .printSchema().

%scala
df.printSchema()

# Output
id: decimal(20,0)
click: integer
hour: integer
C1: integer
banner_pos: integer
site_id: string
site_domain: string
site_category: string
app_id: string
app_domain: string
app_category: string
device_id: string
device_ip: string
device_model: string
device_type: integer
device_conn_type: integer
C14: integer
C15: integer
C16: integer
C17: integer
C18: integer
C19: integer
C20: integer
C21: integer


To optimize the query performance from DBFS, we can convert the CSV files into Parquet format.  Parquet is a columnar file format that allows for efficient querying of big data with Spark SQL or most MPP query engines.  For more information on how Spark is optimized for Parquet, refer to How Apache Spark performs a fast count using the Parquet metadata.

%scala
// Create Parquet files from our Spark DataFrame
df.coalesce(4)
.write
.mode("overwrite")


## Explore Advertising Logs with Spark SQL

Now we can create a Spark SQL temporary view called impression on our Parquet files.  To showcase the flexibility of Databricks notebooks, we can specify to use Python (instead of Scala) in another cell within our notebook.

%python
# Create Spark DataFrame reading the recently created Parquet files

# Create temporary view
impression.createOrReplaceTempView("impression")


We can now explore our data with the familiar and ubiquitous SQL language. Databricks and Spark support Scala, Python, R, and SQL. The following code snippets calculates the click through rate (CTR) by banner position and hour of day.

%sql
-- Calculate CTR by Banner Position
select banner_pos,
sum(case when click = 1 then 1 else 0 end) / (count(1) * 1.0) as CTR
from impression
group by 1
order by 1


%sql
-- Calculate CTR by Hour of the day
select substr(hour, 7) as hour,
sum(case when click = 1 then 1 else 0 end) / (count(1) * 1.0) as CTR
from impression
group by 1
order by 1


## Predict the Clicks

Once we have familiarized ourselves with our data, we can proceed to the machine learning phase, where we convert our data into features for input to a machine learning algorithm and produce a trained model with which we can predict.  Because Spark MLlib algorithms take a column of feature vectors of doubles as input, a typical feature engineering workflow includes:

• Identifying numeric and categorical features
• String indexing
• Assembling them all into a sparse vector

The following code snippet is an example of a feature engineering workflow.

# Include PySpark Feature Engineering methods
from pyspark.ml.feature import StringIndexer, VectorAssembler

# All of the columns (string or integer) are categorical columns
maxBins = 70
categorical = map(lambda c: c[0],
filter(lambda c: c[1] <= maxBins, strColsCount))
categorical += map(lambda c: c[0],
filter(lambda c: c[1] <= maxBins, intColsCount))

# remove 'click' which we are trying to predict
categorical.remove('click')

# Apply string indexer to all of the categorical columns
#  And add _idx to the column name to indicate the index of the
#  categorical value
stringIndexers = map(lambda c: StringIndexer(inputCol = c, outputCol = c + "_idx"), categorical)

# Assemble the put as the input to the VectorAssembler
#   with the output being our features
assemblerInputs = map(lambda c: c + "_idx", categorical)
vectorAssembler = VectorAssembler(inputCols = assemblerInputs, outputCol = "features")

# The [click] column is our label
labelStringIndexer = StringIndexer(inputCol = "click", outputCol = "label")

# The stages of our ML pipeline
stages = stringIndexers + [vectorAssembler, labelStringIndexer]


In our use of GBTClassifer, you may have noticed that while we use string indexer but we are not applying One Hot Encoder (OHE).

When using StringIndexer, categorical features are kept as k-ary categorical features. A tree node will test if feature X has a value in {subset of categories}. With both StringIndexer + OHE: Your categorical features are turned into a bunch of binary features. A tree node will test if feature X = category a vs. all the other categories (one vs. rest test).

When using only StringIndexer, the benefits include:

• There are fewer features to choose
• Each node’s test is more expressive than with binary 1-vs-rest features

Therefore, for because for tree based methods, it is preferable to not use OHE as it is a less expressive test and it takes up more space. But for non-tree-based algorithms such as like linear regression, you must use OHE or else the model will impose a false and misleading ordering on categories.

Thanks to Brooke Wenig and Joseph Bradley for contributing to this post!

With our workflow created, we can create our ML pipeline.

from pyspark.ml import Pipeline

# Create our pipeline
pipeline = Pipeline(stages = stages)

# create transformer to add features
featurizer = pipeline.fit(impression)

# dataframe with feature and intermediate
#   transformation columns appended
featurizedImpressions = featurizer.transform(impression)


Using display(featurizedImpressions.select('features', 'label')), we can visualize our featurized dataset.

Next, we will split our featurized dataset into training and test datasets via .randomSplit().

train, test = features \
.select(["label", "features"]) \
.randomSplit([0.7, 0.3], 42)


Next, we will train, predict, and evaluate our model using the GBTClassifier.  As a side note, a good primer on solving binary classification problems with Spark MLlib is Susan Li’s Machine Learning with PySpark and MLlib — Solving a Binary Classification Problem.

from pyspark.ml.classification import GBTClassifier

# Train our GBTClassifier model
classifier = GBTClassifier(labelCol="label", featuresCol="features", maxBins=maxBins, maxDepth=10, maxIter=10)
model = classifier.fit(train)

# Execute our predictions
predictions = model.transform(test)

# Evaluate our GBTClassifier model using
#   BinaryClassificationEvaluator()
from pyspark.ml.evaluation import BinaryClassificationEvaluator
ev = BinaryClassificationEvaluator( \\
rawPredictionCol="rawPrediction", metricName="areaUnderROC")
print ev.evaluate(predictions)

# Output
0.7112027059


With our predictions, we can evaluate the model according to some evaluation metric, for example, area under the ROC curve, and view features by importance.  We can also see the AUC value which in this case is 0.7112027059.

## Summary

We demonstrated how you can simplify your advertising analytics – including click prediction – using the Databricks Unified Analytics Platform (UAP). With Databricks UAP, we were quickly able to execute our three components for click prediction: ETL, data exploration, and machine learning.  We’ve illustrated how you can run our advanced analytics workflow of ETL, analysis, and machine learning pipelines all within a few Databricks notebook.

By removing the data engineering complexities commonly associated with such data pipelines with the Databricks Unified Analytics Platform, this allows different sets of users i.e. data engineers, data analysts, and data scientists to easily work together.  Try out this notebook series in Databricks today!

--

The post Simplify Advertising Analytics Click Prediction with Databricks Unified Analytics Platform appeared first on Databricks.

### Distilled News

In this tutorial, you´ll use a machine learning algorithm to implement a real-life problem in Python. You will learn how to read multiple text files in python, extract labels, use dataframes and a lot more!
It could be the era of Deep Learning where it really doesn´t matter how big is your dataset or how many columns you´ve got. Still, a lot of Kaggle Competition Winners and Data Scientists emphasis on one thing that could put you on the top of the leaderboard in a Competition is ‘Feature Engineering’. Irrespective of how sophisticated your model is, good features will always help your Machine Learning Model building process better than others.
The internet is filled with tutorials to get started with Deep Learning. You can choose to get started with the superb Stanford courses CS221 or CS224, Fast AI courses or Deep Learning AI courses if you are an absolute beginner. All except Deep Learning AI are free and accessible from the comfort of your home. All you need is a good computer (preferably with a Nvidia GPU) and you are good to take your first steps into Deep Learning. This blog is however not addressing the absolute beginner. Once you have a bit of intuition about how Deep Learning algorithms work, you might want to understand how things work below the hood. While most work in Deep Learning (the 10% apart from Data Munging viz 90% of total work) is adding layers like Conv2d, changing hyperparameters in different types of optimization strategies like ADAM or using batchnorm and other techniques just by writing one line commands in Python (thanks to the awesome frameworks available), a lot of the people might feel a deep desire to know what happens behind the scenes. This is the list of resources which might help you get to know what happens inside the hood when you (say) put a conv2d layer or call T.grad in Theano.
Call centre performance can be expressed by the Grade of Service, which is the percentage of calls that are answered within a specific time, for example, 90% of calls are answered within 30 seconds. This Grade of Service depends on the volume of calls made to the centre, the number of available agents and the time it takes to process a contact. Although working in a call centre can be chaotic, the Erlang C formula describes the relationship between the Grade of Service and these variables quite accurately.
To automate the process of modeling selection and evaluate the results with visualization, I have created some functions into my personal library and today I´m sharing the codes with you. I run them to evaluate and compare Machine Learning models as fast and easily as possible. Currently, they are designed to evaluate binary classification models results.
Here’s your guide to pick the right web scraping tool for your specific data needs.
In previous posts, we covered how to run a Monte Carlo simulation and how to visualize the results. Today, we will wrap that work into a Shiny app wherein a user can build a custom portfolio, and then choose a number of simulations to run and a number of months to simulate into the future.

### The end of errors in ANOVA reporting

(This article was first published on Dominique Makowski, and kindly contributed to R-bloggers)

Psychology is still (unfortunately) massively using analysis of variance (ANOVA). Despite its relative simplicity, I am very often confronted to errors in its reporting, for instance in student’s theses or manuscripts. Beyond the incomplete, uncomprehensible or just wrong reporting, one can find a tremendous amount of genuine errors (that could influence the results and their intepretation), even in published papers! (See the excellent statcheck to quickly check the stats of a paper). This error proneness can be at least partially explained by the fact that copy/pasting the (appropriate) values of any statistical software and formatting them textually is a very annoying process.

How to end it?

We believe that this could be solved (at least, partially) by the default implementation of current best practices of statistical reporting. A tool that automatically transforms a statistical result into a copy/pastable text. Of course, this automation cannot be suitable for each and every advanced usage, but would probably be satisfying for a substantial proportion of use cases. Implementing this unified, end-user oriented pipeline is the goal of the psycho package.

# Fit an anova

Let’s start by doing a traditional ANOVA with adjusting (the ability to flexibly regulate one’s emotions) as dependent variable, and sex and salary as categorical predictors.

# devtools::install_github("neuropsychology/psycho.R")  # Install the latest psycho version
library(psycho)

df <- psycho::affective  # load a dataset available in the psycho package

aov_results <- aov(Adjusting ~ Sex * Salary, data=df)  # Fit the ANOVA
summary(aov_results)  # Inspect the results

             Df Sum Sq Mean Sq F value   Pr(>F)
Sex           1   35.9   35.94  18.162 2.25e-05 ***
Salary        2    9.4    4.70   2.376   0.0936 .
Sex:Salary    2    3.0    1.51   0.761   0.4674
Residuals   859 1699.9    1.98
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
386 observations deleted due to missingness


# APA formatted output

The psycho package include a simple function, analyze() that can be applied to an ANOVA object to format its content.

analyze(aov_results)

   - The effect of Sex is significant (F(1, 859) = 18.16, p < .001) and can be considered as small (Partial Omega-squared = 0.019).
- The effect of Salary is not significant (F(2, 859) = 2.38, p = 0.09°) and can be considered as very small (Partial Omega-squared = 0.0032).
- The interaction between Sex and Salary is not significant (F(2, 859) = 0.76, p > .1) and can be considered as very small (Partial Omega-squared = 0).


It formats the results, computes the partial omega-squared as an index of effect size (better than the eta2, see Levine et al. 2002, Pierce et al. 2004) as well as its interpretation and presents the results in a APA-compatible way.

# Correlations, t-tests, regressions…

Note that the analyze() method also exists for other statistical procudures, such as correlations, t-tests and regressions.

# Evolution

Of course, these reporting standards should change, depending on new expert recommandations or official guidelines. The goal of this package is to flexibly adaptive to new changes and good practices evolution. Therefore, if you have any advices, opinions or such, we encourage you to either let us know by opening an issue, or even better, try to implement them yourself by contributing to the code.

# Credits

This package helped you? Don’t forget to cite the various packages you used

You can cite psycho as follows:

• Makowski, (2018). The psycho Package: An Efficient and Publishing-Oriented Workflow for Psychological Science. Journal of Open Source Software, 3(22), 470. https://doi.org/10.21105/joss.00470

# On similar topics

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

## July 19, 2018

### Book Memo: “Tensor Numerical Methods in Scientific Computing”

 The most difficult computational problems nowadays are those of higher dimensions. This research monograph offers an introduction to tensor numerical methods designed for the solution of the multidimensional problems in scientific computing. These methods are based on the rank-structured approximation of multivariate functions and operators by using the appropriate tensor formats. The old and new rank-structured tensor formats are investigated. We discuss in detail the novel quantized tensor approximation method (QTT) which provides function-operator calculus in higher dimensions in logarithmic complexity rendering super-fast convolution, FFT and wavelet transforms. This book suggests the constructive recipes and computational schemes for a number of real life problems described by the multidimensional partial differential equations. We present the theory and algorithms for the sinc-based separable approximation of the analytic radial basis functions including Green’s and Helmholtz kernels. The efficient tensor-based techniques for computational problems in electronic structure calculations and for the grid-based evaluation of long-range interaction potentials in multi-particle systems are considered. We also discuss the QTT numerical approach in many-particle dynamics, tensor techniques for stochastic/parametric PDEs as well as for the solution and homogenization of the elliptic equations with highly-oscillating coefficients. Contents Theory on separable approximation of multivariate functions Multilinear algebra and nonlinear tensor approximation Superfast computations via quantized tensor approximation Tensor approach to multidimensional integrodifferential equations

### Le Monde puzzle [#1061]

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

A griddy Le Monde mathematical puzzle:

1. On a 4×5 regular grid, find how many nodes need to be turned on to see all 3×4 squares to have at least one active corner in case of one arbitrary node failing.
2.  Repeat for a 7×9 grid.

The question is open to simulated annealing, as in the following R code:

n=3;m=4;np=n+1;mp=m+1

cvr=function(grue){
grud=grue
obj=(max(grue)==0)
for (i in (1:length(grue))[grue==1]){
grud[i]=0
obj=max(obj,max((1-grud[-1,-1])*(1-grud[-np,-mp])*
(1-grud[-np,-1])*(1-grud[-1,-mp])))
grud[i]=1}
obj=99*obj+sum(grue)
return(obj)}

dumban=function(grid,T=1e3,temp=1,beta=.99){
obj=bez=cvr(grid)
sprk=grid
for (t in 1:T){
grue=grid
if (max(grue)==1){ grue[sample(rep((1:length(grid))[grid==1],2),1)]=0
}else{ grue[sample(1:(np*mp),np+mp)]=1}
jbo=cvr(grue)
if (bez>jbo){ bez=jbo;sprk=grue}
if (log(runif(1))<(obj-jbo)/temp){
grid=grue;obj=cvr(grid)}
temp=temp*beta
}
return(list(top=bez,sol=sprk))}


>  dumban(grid,T=1e6,temp=100,beta=.9999)
$top [1] 8$sol
[,1] [,2] [,3] [,4] [,5]
[1,]    0    1    0    1    0
[2,]    0    1    0    1    0
[3,]    0    1    0    1    0
[4,]    0    1    0    1    0


which sounds like a potential winner.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Simplify Streaming Stock Data Analysis Using Databricks Delta

Traditionally, real-time analysis of stock data was a complicated endeavor due to the complexities of maintaining a streaming system and ensuring transactional consistency of legacy and streaming data concurrently.  Databricks Delta helps solve many of the pain points of building a streaming system to analyze stock data in real-time.

In the following diagram, we provide a high-level architecture to simplify this problem.  We start by ingesting two different sets of data into two Databricks Delta tables. The two datasets are stocks prices and fundamentals. After ingesting the data into their respective tables, we then join the data in an ETL process and write the data out into a third Databricks Delta table for downstream analysis.

In this blog post we will review:

• The current problems of running such a system
• How Databricks Delta addresses these problems
• How to implement the system in Databricks

Databricks Delta helps solve these problems by combining the scalability, streaming, and access to advanced analytics of Apache Spark with the performance and ACID compliance of a data warehouse.

## Traditional pain points prior to Databricks Delta

The pain points of a traditional streaming and data warehousing solution can be broken into two groups: data lake and data warehouse pains.

### Data Lake Pain Points

While data lakes allow you to flexibly store an immense amount of data in a file system, there are many pain points including (but not limited to):

• Consolidation of streaming data from many disparate systems is difficult.
• Updating data in a Data Lake is nearly impossible and much of the streaming data needs to be updated as changes are made. This is especially important in scenarios involving financial reconciliation and subsequent adjustments.
• Query speeds for a data lake are typically very slow.
• Optimizing storage and file sizes is very difficult and often require complicated logic.

### Data Warehouse Pain Points

The power of a data warehouse is that you have a persistent performant store of your data.  But the pain points for building modern continuous applications include (but not limited to):

• Constrained to SQL queries; i.e. no machine learning or advanced analytics.
• Accessing streaming data and stored data together is very difficult if at all possible.
• Data warehouses do not scale very well.
• Tying compute and storage together makes using a warehouse very expensive.

## How Databricks Delta Solves These Issues

Databricks Delta (Databricks Delta Guide) is a unified data management system that brings data reliability and performance optimizations to cloud data lakes.  More succinctly, Databricks Delta takes the advantages of data lakes and data warehouses together with Apache Spark to allow you to do incredible things!

• Databricks Delta, along with Structured Streaming, makes it possible to analyze streaming and historical data together at data warehouse speeds.
• Using Databricks Delta tables as sources and destinations of streaming big data make it easy to consolidate disparate data sources.
• Upserts are supported on Databricks Delta tables.
• Your streaming/data lake/warehousing solution has ACID compliance.
• Easily include machine learning scoring and advanced analytics into ETL and queries.
• Decouples compute and storage for a completely scalable solution.

## Implement your streaming stock analysis solution with Databricks Delta

Databricks Delta and Apache Spark do most of the work for our solution; you can try out the full notebook and follow along with the code samples below.   Let’s start by enabling Databricks Delta; as of this writing, Databricks Delta is in private preview so sign up at https://databricks.com/product/databricks-delta.

As noted in the preceding diagram, we have two datasets to process – one for fundamentals and one for price data.  To create our two Databricks Delta tables, we specify the .format(“delta”) against our DBFS locations.

# Create Fundamental Data (Databricks Delta table)
dfBaseFund = spark \\
.format('delta') \\

# Create Price Data (Databricks Delta table)
dfBasePrice = spark \\
.format('delta') \\


While we’re updating the stockFundamentals and stocksDailyPrices, we will consolidate this data through a series of ETL jobs into a consolidated view (stocksDailyPricesWFund).    With the following code snippet, we can determine the start and end date of available data and then combine the price and fundamentals data for that date range into DBFS.

# Determine start and end date of available data
row = dfBasePrice.agg(
func.max(dfBasePrice.price_date).alias("maxDate"),
func.min(dfBasePrice.price_date).alias("minDate")
).collect()[0]
startDate = row["minDate"]
endDate = row["maxDate"]

# Define our date range function
def daterange(start_date, end_date):
for n in range(int ((end_date - start_date).days)):
yield start_date + datetime.timedelta(n)

# Define combinePriceAndFund information by date and
def combinePriceAndFund(theDate):
dfFund = dfBaseFund.where(dfBaseFund.price_date == theDate)
dfPrice = dfBasePrice.where(
dfBasePrice.price_date == theDate
).drop('price_date')
# Drop the updated column
dfPriceWFund = dfPrice.join(dfFund, ['ticker']).drop('updated')

# Save data to DBFS
dfPriceWFund
.write
.format('delta')
.mode('append')
.save('/delta/stocksDailyPricesWFund')

# Loop through dates to complete fundamentals + price ETL process
for single_date in daterange(
startDate, (endDate + datetime.timedelta(days=1))
):
print 'Starting ' + single_date.strftime('%Y-%m-%d')
start = datetime.datetime.now()
combinePriceAndFund(single_date)
end = datetime.datetime.now()
print (end - start)


Now we have a stream of consolidated fundamentals and price data that is being pushed into DBFS in the /delta/stocksDailyPricesWFund location.  We can build a Databricks Delta table by specifying .format(“delta”) against that DBFS location.

dfPriceWithFundamentals = spark
.format("delta")

// Create temporary view of the data
dfPriceWithFundamentals.createOrReplaceTempView("priceWithFundamentals")


Now that we have created our initial Databricks Delta table, let’s create a view that will allow us to calculate the price/earnings ratio in real time (because of the underlying streaming data updating our Databricks Delta table).

%sql
CREATE OR REPLACE TEMPORARY VIEW viewPE AS
select ticker,
price_date,
first(close) as price,
(close/eps_basic_net) as pe
from priceWithFundamentals
where eps_basic_net > 0
group by ticker, price_date, pe


## Analyze streaming stock data in real time

With our view in place, we can quickly analyze our data using Spark SQL.

%sql
select *
from viewPE
where ticker == "AAPL"
order by price_date


As the underlying source of this consolidated dataset is a Databricks Delta table, this view isn’t just showing the batch data but also any new streams of data that are coming in as per the following streaming dashboard.

Underneath the covers, Structured Streaming isn’t just writing the data to Databricks Delta tables but also keeping the state of the distinct number of keys (in this case ticker symbols) that need to be tracked.

Because you are using Spark SQL, you can execute aggregate queries at scale and in real-time.

%sql
SELECT ticker, AVG(close) as Average_Close
FROM priceWithFundamentals
GROUP BY ticker
ORDER BY Average_Close


## Summary

In closing, we demonstrated how to simplify streaming stock data analysis using Databricks Delta.  By combining Spark Structured Streaming and Databricks Delta, we can use the Databricks integrated workspace to create a performant, scalable solution that has the advantages of both data lakes and data warehouses.  The Databricks Unified Analytics Platform removes the data engineering complexities commonly associated with streaming and transactional consistency enabling data engineering and data science teams to focus on understanding the trends in their stock data.

--

The post Simplify Streaming Stock Data Analysis Using Databricks Delta appeared first on Databricks.

### Products for Product People: Best Practices in Analytics, July 24 Webinar

Learn product analytics best practices and the "meta" perspective from a practitioner who is building products that anybody, including product managers, can use to access, analyze, and act on data to make important decisions.

### Whats new on arXiv

Data-driven methods for modeling dynamic systems have received considerable attention as they provide a mechanism for control synthesis directly from the observed time-series data. In the absence of prior assumptions on how the time-series had been generated, regression on the system model has been particularly popular. In the linear case, the resulting least squares setup for model regression, not only provides a computationally viable method to fit a model to the data, but also provides useful insights into the modal properties of the underlying dynamics. Although probabilistic estimates for this model regression have been reported, deterministic error bounds have not been examined in the literature, particularly as they pertain to the properties of the underlying system. In this paper, we provide deterministic non-asymptotic error bounds for fitting a linear model to the observed time-series data, with a particular attention to the role of symmetry and eigenvalue multiplicity in the underlying system matrix.
Machine learning models benefit from large and diverse datasets. Using such datasets, however, often requires trusting a centralized data aggregator. For sensitive applications like healthcare and finance this is undesirable as it could compromise patient privacy or divulge trade secrets. Recent advances in secure and privacy-preserving computation, including trusted hardware enclaves and differential privacy, offer a way for mutually distrusting parties to efficiently train a machine learning model without revealing the training data. In this work, we introduce Myelin, a deep learning framework which combines these privacy-preservation primitives, and use it to establish a baseline level of performance for fully private machine learning.
Recent studies have shown that deep neural networks (DNNs) are vulnerable to adversarial attacks. To this end, many defense approaches that attempt to improve the robustness of DNNs have been proposed. In a separate and yet related area, recent works have explored to quantize neural network weights and activation functions into low bit-width to compress model size and reduce computational complexity. In this work,we find that these two different tracks, namely the pursuit of network compactness and robustness, can bemerged into one and give rise to networks of both advantages. To the best of our knowledge, this is the first work that uses quantization of activation functions to defend against adversarial examples. We also propose to train robust neural networks by using adaptive quantization techniques for the activation functions. Our proposed Dynamic Quantized Activation (DQA) is verified through a wide range of experiments with the MNIST and CIFAR-10 datasets under different white-box attack methods, including FGSM, PGD, andC&W attacks. Furthermore, Zeroth Order Optimization and substitute model based black-box attacks are also considered in this work. The experimental results clearly show that the robustness of DNNs could be greatly improved using the proposed DQA.
The term ‘interpretability’ is oftenly used by machine learning researchers each with their own intuitive understanding of it. There is no universal well agreed upon definition of interpretability in machine learning. As any type of science discipline is mainly driven by the set of formulated questions rather than by different tools in that discipline, e.g. astrophysics is the discipline that learns the composition of stars, not as the discipline that use the spectroscopes. Similarly, we propose that machine learning interpretability should be a discipline that answers specific questions related to interpretability. These questions can be of statistical, causal and counterfactual nature. Therefore, there is a need to look into the interpretability problem of machine learning in the context of questions that need to be addressed rather than different tools. We discuss about a hypothetical interpretability framework driven by a question based scientific approach rather than some specific machine learning model. Using a question based notion of interpretability, we can step towards understanding the science of machine learning rather than its engineering. This notion will also help us understanding any specific problem more in depth rather than relying solely on machine learning methods.
An important task for a recommender system to provide interpretable explanations for the user. This is important for the credibility of the system. Current interpretable recommender systems tend to focus on certain features known to be important to the user and offer their explanations in a structured form. It is well known that user generated reviews and feedback from reviewers have strong leverage over the users’ decisions. On the other hand, recent text generation works have been shown to generate text of similar quality to human written text, and we aim to show that generated text can be successfully used to explain recommendations. In this paper, we propose a framework consisting of popular review-oriented generation models aiming to create personalised explanations for recommendations. The interpretations are generated at both character and word levels. We build a dataset containing reviewers’ feedback from the Amazon books review dataset. Our cross-domain experiments are designed to bridge from natural language processing to the recommender system domain. Besides language model evaluation methods, we employ DeepCoNN, a novel review-oriented recommender system using a deep neural network, to evaluate the recommendation performance of generated reviews by root mean square error (RMSE). We demonstrate that the synthetic personalised reviews have better recommendation performance than human written reviews. To our knowledge, this presents the first machine-generated natural language explanations for rating prediction.
Recently, deep reinforcement learning (RL) methods have been applied successfully to multi-agent scenarios. Typically, these methods rely on a concatenation of agent states to represent the information content required for decentralized decision making. However, concatenation scales poorly to swarm systems with a large number of homogeneous agents as it does not exploit the fundamental properties inherent to these systems: (i) the agents in the swarm are interchangeable and (ii) the exact number of agents in the swarm is irrelevant. Therefore, we propose a new state representation for deep multi-agent RL based on mean embeddings of distributions. We treat the agents as samples of a distribution and use the empirical mean embedding as input for a decentralized policy. We define different feature spaces of the mean embedding using histograms, radial basis functions and a neural network learned end-to-end. We evaluate the representation on two well known problems from the swarm literature (rendezvous and pursuit evasion), in a globally and locally observable setup. For the local setup we furthermore introduce simple communication protocols. Of all approaches, the mean embedding representation using neural network features enables the richest information exchange between neighboring agents facilitating the development of more complex collective strategies.
Previous studies have shown that linguistic features of a word such as possession, genitive or other grammatical cases can be employed in word representations of a named entity recognition (NER) tagger to improve the performance for morphologically rich languages. However, these taggers require external morphological disambiguation (MD) tools to function which are hard to obtain or non-existent for many languages. In this work, we propose a model which alleviates the need for such disambiguators by jointly learning NER and MD taggers in languages for which one can provide a list of candidate morphological analyses. We show that this can be done independent of the morphological annotation schemes, which differ among languages. Our experiments employing three different model architectures that join these two tasks show that joint learning improves NER performance. Furthermore, the morphological disambiguator’s performance is shown to be competitive.
Deep neural networks and decision trees operate on largely separate paradigms; typically, the former performs representation learning with pre-specified architectures, while the latter is characterised by learning hierarchies over pre-specified features with data-driven architectures. We unite the two via adaptive neural trees (ANTs), a model that incorporates representation learning into edges, routing functions and leaf nodes of a decision tree, along with a backpropagation-based training algorithm that adaptively grows the architecture from primitive modules (e.g., convolutional layers). We demonstrate that, whilst achieving over 99% and 90% accuracy on MNIST and CIFAR-10 datasets, ANTs benefit from (i) faster inference via conditional computation, (ii) increased interpretability via hierarchical clustering e.g. learning meaningful class associations, such as separating natural vs. man-made objects, and (iii) a mechanism to adapt the architecture to the size and complexity of the training dataset.
Many problems that appear in biomedical decision making, such as diagnosing disease and predicting response to treatment, can be expressed as binary classification problems. The costs of false positives and false negatives vary across application domains and receiver operating characteristic (ROC) curves provide a visual representation of this trade-off. Nonparametric estimators for the ROC curve, such as a weighted support vector machine (SVM), are desirable because they are robust to model misspecification. While weighted SVMs have great potential for estimating ROC curves, their theoretical properties were heretofore underdeveloped. We propose a method for constructing confidence bands for the SVM ROC curve and provide the theoretical justification for the SVM ROC curve by showing that the risk function of the estimated decision rule is uniformly consistent across the weight parameter. We demonstrate the proposed confidence band method and the superior sensitivity and specificity of the weighted SVM compared to commonly used methods in diagnostic medicine using simulation studies. We present two illustrative examples: diagnosis of hepatitis C and a predictive model for treatment response in breast cancer.
In this paper, we prove the first theoretical results on dependency leakage — a phenomenon in which learning on noisy clusters biases cross-validation and model selection results. This is a major concern for domains involving human record databases (e.g. medical, census, advertising), which are almost always noisy due to the effects of record linkage and which require special attention to machine learning bias. The proposed theoretical properties justify regularization choices in several existing statistical estimators and allow us to construct the first hypothesis test for cross-validation bias due to dependency leakage. Furthermore, we propose a novel matrix sketching technique which, along with standard function approximation techniques, enables dramatically improving the sample and computational scalability of existing estimators. Empirical results on several benchmark datasets validate our theoretical results and proposed methods.
The detection of software vulnerabilities (or vulnerabilities for short) is an important problem that has yet to be tackled, as manifested by many vulnerabilities reported on a daily basis. This calls for machine learning methods to automate vulnerability detection. Deep learning is attractive for this purpose because it does not require human experts to manually define features. Despite the tremendous success of deep learning in other domains, its applicability to vulnerability detection is not systematically understood. In order to fill this void, we propose the first systematic framework for using deep learning to detect vulnerabilities. The framework, dubbed Syntax-based, Semantics-based, and Vector Representations (SySeVR), focuses on obtaining program representations that can accommodate syntax and semantic information pertinent to vulnerabilities. Our experiments with 4 software products demonstrate the usefulness of the framework: we detect 15 vulnerabilities that are not reported in the National Vulnerability Database. Among these 15 vulnerabilities, 7 are unknown and have been reported to the vendors, and the other 8 have been ‘silently’ patched by the vendors when releasing newer versions of the products.
In this paper we show that restricting the representation-layer of a Recurrent Neural Network (RNN) improves accuracy and reduces the depth of recursive training procedures in partially observable domains. Artificial Neural Networks have been shown to learn useful state representations for high-dimensional visual and continuous control domains. If the the tasks at hand exhibits long depends back in time, these instantaneous feed-forward approaches are augmented with recurrent connections and trained with Back-prop Through Time (BPTT). This unrolled training can become computationally prohibitive if the dependency structure is long, and while recent work on LSTMs and GRUs has improved upon naive training strategies, there is still room for improvements in computational efficiency and parameter sensitivity. In this paper we explore a simple modification to the classic RNN structure: restricting the state to be comprised of multi-step General Value Function predictions. We formulate an architecture called General Value Function Networks (GVFNs), and corresponding objective that generalizes beyond previous approaches. We show that our GVFNs are significantly more robust to train, and facilitate accurate prediction with no gradients needed back-in-time in domains with substantial long-term dependences.
To solve deep neural network (DNN)’s huge training dataset and its high computation issue, so-called teacher-student (T-S) DNN which transfers the knowledge of T-DNN to S-DNN has been proposed. However, the existing T-S-DNN has limited range of use, and the knowledge of T-DNN is insufficiently transferred to S-DNN. To improve the quality of the transferred knowledge from T-DNN, we propose a new knowledge distillation using singular value decomposition (SVD). In addition, we define a knowledge transfer as a self-supervised task and suggest a way to continuously receive information from T-DNN. Simulation results show that a S-DNN with a computational cost of 1/5 of the T-DNN can be up to 1.1\% better than the T-DNN in terms of classification accuracy. Also assuming the same computational cost, our S-DNN outperforms the S-DNN driven by the state-of-the-art distillation with a performance advantage of 1.79\%. code is available on https://…/SSKD\_SVD.
User-based Collaborative Filtering (CF) is one of the most popular approaches to create recommender systems. This approach is based on finding the most relevant k users from whose rating history we can extract items to recommend. CF, however, suffers from data sparsity and the cold-start problem since users often rate only a small fraction of available items. One solution is to incorporate additional information into the recommendation process such as explicit trust scores that are assigned by users to others or implicit trust relationships that result from social connections between users. Such relationships typically form a very sparse trust network, which can be utilized to generate recommendations for users based on people they trust. In our work, we explore the use of a measure from network science, i.e. regular equivalence, applied to a trust network to generate a similarity matrix that is used to select the k-nearest neighbors for recommending items. We evaluate our approach on Epinions and we find that we can outperform related methods for tackling cold-start users in terms of recommendation accuracy.
While existing work on neural architecture search (NAS) tunes hyperparameters in a separate post-processing step, we demonstrate that architectural choices and other hyperparameter settings interact in a way that can render this separation suboptimal. Likewise, we demonstrate that the common practice of using very few epochs during the main NAS and much larger numbers of epochs during a post-processing step is inefficient due to little correlation in the relative rankings for these two training regimes. To combat both of these problems, we propose to use a recent combination of Bayesian optimization and Hyperband for efficient joint neural architecture and hyperparameter search.
A long-standing problem in model free reinforcement learning (RL) is that it requires a large number of trials to learn a good policy, especially in environments with sparse rewards. We explore a method to increase the sample efficiency of RL when we have access to demonstrations. Our approach, which we call Backplay, uses a single demonstration to construct a curriculum for a given task. Rather than starting each training episode in the environment’s fixed initial state, we start the agent near the end of the demonstration and move the starting point backwards during the course of training until we reach the initial state. We perform experiments in a competitive four player game (Pommerman) and a path-finding maze game. We find that this weak form of guidance provides significant gains in sample complexity with a stark advantage in sparse reward environments. In some cases, standard RL did not yield any improvement while Backplay reached success rates greater than 50% and generalized to unseen initial conditions in the same amount of training time. Additionally, we see that agents trained via Backplay can learn policies superior to those of the original demonstration.
The performance of many machine learning techniques depends on the choice of an appropriate similarity or distance measure on the input space. Similarity learning (or metric learning) aims at building such a measure from training data so that observations with the same (resp. different) label are as close (resp. far) as possible. In this paper, similarity learning is investigated from the perspective of pairwise bipartite ranking, where the goal is to rank the elements of a database by decreasing order of the probability that they share the same label with some query data point, based on the similarity scores. A natural performance criterion in this setting is pointwise ROC optimization: maximize the true positive rate under a fixed false positive rate. We study this novel perspective on similarity learning through a rigorous probabilistic framework. The empirical version of the problem gives rise to a constrained optimization formulation involving U-statistics, for which we derive universal learning rates as well as faster rates under a noise assumption on the data distribution. We also address the large-scale setting by analyzing the effect of sampling-based approximations. Our theoretical results are supported by illustrative numerical experiments.
Structural estimation is an important methodology in empirical economics, and a large class of structural models are estimated through the generalized method of moments (GMM). Traditionally, selection of structural models has been performed based on model fit upon estimation, which take the entire observed samples. In this paper, we propose a model selection procedure based on cross-validation (CV), which utilizes sample-splitting technique to avoid issues such as over-fitting. While CV is widely used in machine learning communities, we are the first to prove its consistency in model selection in GMM framework. Its empirical property is compared to existing methods by simulations of IV regressions and oligopoly market model. In addition, we propose the way to apply our method to Mathematical Programming of Equilibrium Constraint (MPEC) approach. Finally, we perform our method to online-retail sales data to compare dynamic market model to static model.
This is the write-up of the talk I gave at the 23rd International Symposium on Mathematical Programming (ISMP) in Bordeaux, France, July 6th, 2018. The talk was a general overview of the state of the art of time-varying, mainly convex, optimization, with special emphasis on discrete-time algorithms and applications in energy and transportation. This write-up is mathematically correct, while its style is somewhat less formal than a standard paper.

### Make Your Oil and Gas Assets Smarter by Implementing Predictive Maintenance with Databricks

How to build an end-to-end predictive data pipeline with Databricks Delta and Spark Streaming
Try this notebook in Databricks

Maintaining assets such as compressors is an extremely complex endeavor: they are used in everything from small drilling rigs to deep-water platforms, the assets are located across the globe, and they generate terabytes of data daily.  A failure for just one of these compressors costs millions of dollars of lost production per day. An important way to save time and money is to use machine learning to predict outages and issue maintenance work orders before the failure occurs.

Ultimately, you need to build an end-to-end predictive data pipeline that can provide a real-time database to maintain asset parts and sensor mappings, support a continuous application that processes a massive amount of telemetry, and allows you to predict compressor failures against these datasets.

Our approach to addressing these issues is by selecting a unified platform that offers these capabilities. Databricks provides a Unified Analytics Platform that brings together big data and AI and allows the different personas of your organization to come together and collaborate in a single workspace.  Other important advantages of the Databricks Unified Analytics Platform include the ability to:

• Spin up the necessary resources and have your data scientists, data engineers, and data analysts making sense of their data quickly.
• Have a multi-cloud strategy allowing everyone to use the same collaborative workspace in Azure or AWS.
• Stand up a diverse set of instance type combinations to optimally run your workloads
• Schedule commands (including REST API commands) that allows you to auto-create and auto-terminate your clusters.
• Quickly and easily enable access control to assign permissions as well as enable access tokens for secure REST API calls when productionizing your solution.

In this blog post, we will show how you can make your oil and gas assets smarter by:

• Using Spark Streaming in Databricks to process the immense amount of sensor telemetry.
• Building and deploying your machine learning models to predict asset failures before they happen.
• Creating a real-time database using Databricks Delta to store and stream sensor parts and assets.

To predict catastrophic failures, we need to combine the asset sensors continuous stream of data from Kinesis, Spark Streaming, and our Streaming K-Means model.  Let’s start by configuring our Kinesis stream using the code snippet below. To dive deeper, refer to Databricks – Amazon Kinesis Integration.

// === Configurations for Kinesis streams ===
val awsAccessKeyId = "YOUR ACCESS KEY ID"
val awsSecretKey = "YOUR SECRET KEY"
val kinesisStreamName = "YOUR STREAM NAME"
val kinesisRegion = "YOUR REGION" // e.g., "us-west-2"

import com.amazonaws.services.kinesis.model.PutRecordRequest
import com.amazonaws.services.kinesis.AmazonKinesisClientBuilder
import com.amazonaws.auth.{DefaultAWSCredentialsProviderChain, BasicAWSCredentials}
import java.nio.ByteBuffer
import scala.util.Random


With your credentials established, you can run a Spark Streaming query that reads words from Kinesis and counts them up with the following code snippet.

// Establish Kinesis Stream
.format("kinesis")
.option("streamName", kinesisStreamName)
.option("region", kinesisRegion)
.option("initialPosition", "TRIM_HORIZON")
.option("awsAccessKey", awsAccessKeyId)
.option("awsSecretKey", awsSecretKey)

// Execute DataFrame query agaijnst the Kinesis Stream
val result = kinesis.selectExpr("lcase(CAST(data as STRING)) as word")
.groupBy($"word") .count() // Display the output as a bar chart display(result)  To configure your own Kinesis stream, write those words to your Kinesis Stream by creating a low-level Kinesis client such as the following code snippet that loops every 5s. // Create the low-level Kinesis Client from the AWS Java SDK. val kinesisClient = AmazonKinesisClientBuilder.standard() .withRegion(kinesisRegion) .withCredentials(new AWSStaticCredentialsProvider(new BasicAWSCredentials(awsAccessKeyId, awsSecretKey))) .build() println(s"Putting words onto stream$kinesisStreamName")
var lastSequenceNumber: String = null

for (i <- 0 to 10) {
val time = System.currentTimeMillis
// Generate words: fox in sox
for (word <- Seq("Sensor1", "Sensor2", "Sensor3", "Sensor4", "Sensor1", "Sensor3", "Sensor4", "Sensor5", "Sensor2", "Sensor3","Sensor1", "Sensor2","Sensor1", "Sensor2")) {
val data = s"$word" val partitionKey = s"$word"
val request = new PutRecordRequest()
.withStreamName(kinesisStreamName)
.withPartitionKey(partitionKey)
.withData(ByteBuffer.wrap(data.getBytes()))
if (lastSequenceNumber != null) {
request.setSequenceNumberForOrdering(lastSequenceNumber)
}
val result = kinesisClient.putRecord(request)
lastSequenceNumber = result.getSequenceNumber()
}
Thread.sleep(math.max(5000 - (System.currentTimeMillis - time), 0)) // loop around every ~5 seconds
}


Before we can build our model to predict healthy vs. damaged compressors, let’s start by doing a little data exploration.  First, we need to import our healthy and damaged compressor data; the following code snippet imports the healthy compressor data that is in CSV format into a Spark SQL DataFrame.

// Read healthy compressor readings (represented by H1 prefix)
.schema(StructType(
StructField("AN10", DoubleType, false) ::
StructField("AN3", DoubleType, false) ::
StructField("AN4", DoubleType, false) ::
StructField("AN5", DoubleType, false) ::
StructField("AN6", DoubleType, false) ::
StructField("AN7", DoubleType, false) ::
StructField("AN8", DoubleType, false) ::
StructField("AN9", DoubleType, false) ::
StructField("SPEED", DoubleType, false) :: Nil)

// Create Healthy Compressor Spark SQL Table

val compressor_healthy = table("compressor_healthy")


We also save the data as a Spark SQL table so we can query it using Spark SQL.  For example, we can use the Databricks display command to view the table statistics of our damaged compressor table.

display(compressor_damaged.describe())

After taking a random sample of healthy and damaged data using the following code snippet:

// Obtain a random sample of healthy and damaged compressors
val randomSample = compressor_healthy.withColumn("ReadingType", lit("HEALTHY")).sample(false, 500/4800000.0)


we can use the Databricks display command to visualize our random sample of data using a scatter plot.

// View scatter plot of healthy vs. damaged compressor readings
display(randomSample)


## Building our Model

The next steps for implementing our predictive maintenance model is to create a K-Means model to cluster our datasets to predict damaged vs. healthy compressors. In addition to K-Means being a popular and well-understood clustering algorithm, there is also the benefit of using a streaming k-means model allowing us to easily execute the same model in batch and in streaming scenarios.

The first thing we want to do is to determine the optimal k value (i.e. optimal number of clusters). As we are currently identifying the difference between healthy and damaged, intuitively the value of k is 2 but let’s validate. As noted in the following code snippet, we will build an ML pipeline so we can easily re-use the model for our new dataset (i.e. the streaming dataset upstream). Our ML pipeline is relatively straightforward using VectorAssembler to define our features involving the Air and Noise columns (i.e. columns preceding with AN) and scaling it using MinMaxScaler.

import org.apache.spark.ml._
import org.apache.spark.ml.feature._
import org.apache.spark.ml.clustering._
import org.apache.spark.mllib.linalg.Vectors

// Using KMeansModel
val models : Array[org.apache.spark.mllib.clustering.KMeansModel]  = new Array[org.apache.spark.mllib.clustering.KMeansModel](10)

// Use VectorAssembler to define our features based on the Air + Noise columns (and scale it)
val vectorAssembler = new VectorAssembler().setInputCols(compressor_healthy.columns.filter(_.startsWith("AN"))).setOutputCol("features")
val mmScaler = new MinMaxScaler().setInputCol("features").setOutputCol("scaled")

// Build our ML Pipeline
val pipeline = new Pipeline()
.setStages(Array(vectorAssembler, mmScaler))

// Build our model based on healthy compressor data
val prepModel = pipeline.fit(compressor_healthy)
val prepData = prepModel.transform(compressor_healthy).cache()

// Iterate to find the best K values
val maxIter = 20
val maxK = 5
val findBestK = for (k <- 2 to maxK) yield {
val kmeans = new KMeans().setK(k).setSeed(1L).setMaxIter(maxIter).setFeaturesCol("scaled")
val model = kmeans.fit(prepData)
val wssse = model.computeCost(prepData)
(k, wssse)
}


We run a number of iterations to determine the best k value though, for the purpose of this demo, we init ourselves to k values [2…5] and set the max iterations to 20. The goal is to iterate through the various k and WSSSE (Within Set Sum of Squared Error) values; the optimal k value (the ideal number of clusters) is the one where there is an “elbow” in the WSSSE graph. We can also calculate the highest derivative of the graph with the following code snippet.

// Calculate Derivative of WSSSE
val previousDf = kWssseDf.withColumn("k", $"k"-1).withColumnRenamed("wssse", "previousWssse") val derivativeOfWssse = previousDf.join(kWssseDf, "k").selectExpr("k", "previousWssse - wssse derivative").orderBy($"k")

// find the point with the "highest" derivative
// i.e. optimal number of clusters is bestK = 2
val bestK = derivativeOfWssse
.select(

### If you did not already know

In computer science, an online algorithm measures its competitiveness against different adversary models. For deterministic algorithms, the adversary is the same, the adaptive offline adversary. For randomized online algorithms competitiveness can depend upon the adversary model used. …

N-Gram
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram” (or, less commonly, a “digram”); size 3 is a “trigram”. Larger sizes are sometimes referred to by the value of n, e.g., “four-gram”, “five-gram”, and so on. …

Similarity-Based Imbalanced Classification (SBIC)
When the training data in a two-class classification problem is overwhelmed by one class, most classification techniques fail to correctly identify the data points belonging to the underrepresented class. We propose Similarity-based Imbalanced Classification (SBIC) that learns patterns in the training data based on an empirical similarity function. To take the imbalanced structure of the training data into account, SBIC utilizes the concept of absent data, i.e. data from the minority class which can help better find the boundary between the two classes. SBIC simultaneously optimizes the weights of the empirical similarity function and finds the locations of absent data points. As such, SBIC uses an embedded mechanism for synthetic data generation which does not modify the training dataset, but alters the algorithm to suit imbalanced datasets. Therefore, SBIC uses the ideas of both major schools of thoughts in imbalanced classification: Like cost-sensitive approaches SBIC operates on an algorithm level to handle imbalanced structures; and similar to synthetic data generation approaches, it utilizes the properties of unobserved data points from the minority class. The application of SBIC to imbalanced datasets suggests it is comparable to, and in some cases outperforms, other commonly used classification techniques for imbalanced datasets. …

### Supply chains based on modern slavery may reach into the West

IN THEORY slavery was completely abolished in 1981, when Mauritania became the last country to outlaw forced labour. In practice, however, it persists in many forms, some of them surprisingly blatant. In November CNN broadcast a grainy video depicting the auction of 12 migrant Nigerian men for farm work.

### compstak: Jr Data Analyst

Seeking talented full time data analysts to help us improve our data pipeline in order to ensure high quality and better scalability. This includes data extraction, preprocessing, augmentation, and verification projects.

### Highlights from the useR! 2018 conference in Brisbane

The fourteenth annual worldwide R user conference, useR!2018, was held last week in Brisbane, Australia and it was an outstanding success. The conference attracted around 600 users from around the world and — as the first held in the Southern hemisphere — brought many first-time conference-goers to useR!. (There were also a number of beginning R users as well, judging from the attendance at the beginner's tutorial hosted by R-Ladies.) The program included 19 3-hour workshops, 6 keynote presentations, and more than 200 contributed talkslightning talks and posters on using, extending, and deploying R.

If you weren't able to make it to Brisbane, you can nonetheless relive the experience thanks to the recorded videos. Almost all of the tutorials, keynotes and talks are available to view for free, courtesy of the R Consortium. (A few remain to be posted, so keep an eye on the channel.) Here are a few of my personal highlights, based on talks I saw in Brisbane or have managed to catch online since then.

### Keynote talks

Steph de Silva, Beyond Syntax: on the power and potentiality of deep open source communities. A moving look at how open source communities, and especially R, grow and evolve.

Bill Venables, Adventures with R. It was wonderful to see the story and details behind an elegantly designed experiment investigating spoken language, and this example was used to great effect to contrast the definitions of "Statistics" and "Data Science". Bill also includes the best piece advice to give anyone joining a specialized group: "Everyone here is smart; distinguish yourself by being kind".

Kelly O'Brian's short history of RStudio was an interesting look at the impact of RStudio (the IDE and the company) on the R ecosystem.

Thomas Lin Pedersen, The Grammar of Graphics. A really thought-provoking talk about the place of animations in the sphere of data visualization, and an introduction to the gganimate package which extends ggplot2 in a really elegant and powerful way.

Danielle Navarro, R for Pysychological Science. A great case study in introducing statistical programming to social scientists.

Roger Peng, Teaching R to New Users. A fascinating history of the R project, and how changes in the user community have been reflected in changes in programming frameworks. The companion essay summarizes the talk clearly and concisely.

Jenny Bryan, Code Smells. This was an amazing talk with practical recommendations for better R coding practices. The video isn't online yet, but the slides are available to view online.

### Contributed talks

Bryan Galvin, Moving from Prototype to Production in R, a look inside the machine learning infrastructure at Netflix. Who says R doesn't scale?

Peter Dalgaard, What's in a Name? The secrets of the R build and release process, and the story behind their codenames.

Martin Maechler, Helping R to be (even more) Accurate. On R's near-obsessive attention to the details of computational accuracy.

Rob Hyndman, Tidy Forecasting in R. The next generation of time series forecasting methods in R.

Nicholas Tierney, Maxcovr: Find the best locations for facilities using the maximal covering location problem. Giftastic!

David Smith Speeding up computations in R with parallel programming in the cloud. My talk on the doAzureParallel package.

David Smith, The Voice of the R Community. My talk for the R Consortium with the results of their community survey.

In addition, several of my colleagues from Microsoft were in attendance (Microsoft was a proud Platinum sponsor of useR!2018) and delivered talks of their own:

Angus Taylor, Deep Learning at Scale with Azure Batch AI

Miguel Fierro, Spark on Demand with AZTK

Overall, I thought useR!2018 was a wonderful conference. Great talks, friendly people, and impeccably organized. Kudos to all of the organizing committee, and particularly Di Cook, for putting together such a fantastic event. Next year's conference will be held in Toulouse, France and already has a great set of keynote speakers announced. But in the meantime, you can catch up on the talks from useR!2018 at the R Consortium YouTube channel linked below.

### The ultimate list of Web Scraping tools and software

Here's your guide to pick the right web scraping tool for your specific data needs.

### How to Read an Excel file into R

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

## Installing R package

Because flipAPI does not require external libraries that use Java or Perl, installation is very straightforward. Simply open R and type the following into the console:

install.package(devtools)
devtools::install_github("Displayr/flipAPI")


Once the flipAPI package is installed, a file can be read in by using the command

library(flipAPI)


## Data output format

In many cases, the Excel format contains multiple tables with comments and other text. It is not necessary to reformat the file before importing. We can specify particular sheets or ranges to import.

cola1 = DownloadXLSX("https://wiki.q-researchsoftware.com/images/b/b9/Cola_Discriminant_Functions.xlsx", want.col.names = TRUE, range = "A2:G9")
cola2 = DownloadXLSX("https://wiki.q-researchsoftware.com/images/b/b9/Cola_Discriminant_Functions.xlsx", want.col.names = TRUE, want.row.names = FALSE, sheet = 2, range = "AB2:AC330"


To check the result of these command, we type

str(cola1)


and see the output:

 num [1:7, 1:6] -3.4 2.653 -0.566 -0.458 -0.428 ...
- attr(*, "dimnames")=List of 2
..$: chr [1:7] "Intercept" "Coca-Cola" "Diet Coke" "Coke Zero" ... ..$ : chr [1:6] "Coca-Cola" "Diet Coke" "Coke Zero" "Pepsi" ...


Similarly, we type str(cola2) and get output:

'data.frame':	328 obs. of  2 variables:
$Highest Score : num 1.1202 1.8786 1.8311 3.6638 0.0754 ...$ Predicted Preferred Cola: Factor w/ 6 levels "Coca-Cola","Coke Zero",..: 6 6 3 1 3 6 2 1 3 2 ...


We can see that DownloadXLSX automatically parses and converts the data into the correct format. cola1, which contains only numeric data, is converted into a matrix, whereas cola2, which has both numeric and categorical data, is converted into a data frame.

## Importing Excel files from cloud storage

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Best (and Free!!) Resources to Understand Nuts and Bolts of Deep Learning

This blog is however not addressing the absolute beginner. Once you have a bit of intuition about how Deep Learning algorithms work, you might want to understand how things work below the hood.

### How to use goals to improve a product’s UX

Measuring each stage of the user journey allows you to measure conversions and improve user experience.

Continue reading How to use goals to improve a product’s UX.

### Charles River Analytics: Software Engineer II – React.js Developer for Streaming Data Visualization

Seeking a React.js developer to join a top-notch team of software engineers and data scientists developing web-based solutions for both DoD and commercial customers.

### Charles River Analytics: Sr Software Engineer – Intelligent Systems

Seeking an experienced and enthusiastic Senior Software Engineer to design and develop cutting-edge intelligent systems applied to areas such as intelligent tutoring, serious games, crowdsourcing, skill modeling and assessment, and advanced visualization.

### OpenCV Tutorial: A Guide to Learn OpenCV

Whether you’re interested in learning how to apply facial recognition to video streams, building a complete deep learning pipeline for image classification, or simply want to tinker with your Raspberry Pi and add image recognition to a hobby project, you’ll need to learn OpenCV somewhere along the way.

The truth is that learning OpenCV used to be quite challenging. The documentation was hard to navigate. The tutorials were hard to follow and incomplete. And even some of the books were a bit tedious to work through.

The good news is learning OpenCV isn’t as hard as it used to be. And in fact, I’ll go as far as to say studying OpenCV has become significantly easier.

And to prove it to you (and help you learn OpenCV), I’ve put together this complete guide to learning the fundamentals of the OpenCV library using the Python programming language.

Let’s go ahead and get started learning the basics of OpenCV and image processing. By the end of today’s blog post, you’ll understand the fundamentals of OpenCV.

Looking for the source code to this post?

## OpenCV Tutorial: A Guide to Learn OpenCV

This OpenCV tutorial is for beginners just getting started learning the basics. Inside this guide, you’ll learn basic image processing operations using the OpenCV library using Python.

And by the end of the tutorial you’ll be putting together a complete project to count basic objects in images using contours.

While this tutorial is aimed at beginners just getting started with image processing and the OpenCV library, I encourage you to give it a read even if you have a bit of experience.

### Installing OpenCV and imutils on your system

The first step today is to install OpenCV on your system (if you haven’t already).

I maintain an OpenCV Install Tutorials page which contains links to previous OpenCV installation guides for Ubuntu, macOS, and Raspberry Pi.

You should visit that page and find + follow the appropriate guide for your system.

Once your fresh OpenCV development environment is set up, install the imutils package via pip. I have created and maintained

imutils
(source on GitHub) for the image processing community and it is used heavily on my blog. You should install
imutils
in the same environment you installed OpenCV into — you’ll need it to work through this blog post as it will facilitate basic image processing operations:

$pip install imutils Note: If you are using Python virtual environments don’t forget to use the workon command to enter your environment before installing imutils ! ### OpenCV Project Structure Before going too far down the rabbit hole, be sure to grab the code + images from the “Downloads” section of today’s blog post. From there, navigate to where you downloaded the .zip in your terminal ( cd ). And then we can unzip the archive, change working directories ( cd ) into the project folder, and analyze the project structure via tree : $ cd ~/Downloads
$unzip opencv-tutorial.zip$ cd opencv-tutorial
$tree . ├── jp.png ├── opencv_tutorial_01.py ├── opencv_tutorial_02.py └── tetris_blocks.png 0 directories, 4 files In this tutorial we’ll be creating two Python scripts to help you learn OpenCV basics: 1. Our first script, opencv_tutorial_01.py will cover basic image processing operations using an image from the movie, Jurassic Park jp.png ). 2. From there, opencv_tutorial_02.py will show you how to use these image processing building blocks to create an OpenCV application to count the number of objects in a Tetris image ( tetris_blocks.png ). ### Loading and displaying an image Figure 1: Learning OpenCV basics with Python begins with loading and displaying an image — a simple process that requires only a few lines of code. Let’s begin by opening up opencv_tutorial_01.py in your favorite text editor or IDE: # import the necessary packages import imutils import cv2 # load the input image and show its dimensions, keeping in mind that # images are represented as a multi-dimensional NumPy array with # shape no. rows (height) x no. columns (width) x no. channels (depth) image = cv2.imread("jp.png") (h, w, d) = image.shape print("width={}, height={}, depth={}".format(w, h, d)) # display the image to our screen -- we will need to click the window # open by OpenCV and press a key on our keyboard to continue execution cv2.imshow("Image", image) cv2.waitKey(0) On Lines 2 and 3 we import both imutils and cv2 . The cv2 package is OpenCV and despite the 2 embedded, it can actually be OpenCV 3 (or possibly OpenCV 4 which may be released later in 2018). The imutils package is my series of convenience functions. Now that we have the required software at our fingertips via imports, let’s load an image from disk into memory. To load our Jurassic Park image (from one of my favorite movies), we call cv2.imread("jp.png") . As you can see on Line 8, we assign the result to image . Our image is actually just a NumPy array. Later in this script, we’ll need the height and width. So on Line 9, I call image.shape to extract the height, width, and depth. It may seem confusing that the height comes before the width, but think of it this way: • We describe matrices by # of rows x # of columns • The number of rows is our height • And the number of columns is our width Therefore, the dimensions of an image represented as a NumPy array are actually represented as (height, width, depth). Depth is the number of channels — in our case this is three since we’re working with 3 color channels: Blue, Green, and Red. The print command shown on Line 10 will output the values to the terminal: width=600, height=322, depth=3 To display the image on the screen using OpenCV we employ cv2.imshow("Image", image) on Line 14. The subsequent line waits for a keypress (Line 15). This is important otherwise our image would display and disappear faster than we’d even see the image. Note: You need to actually click the active window opened by OpenCV and press a key on your keyboard to advance the script. OpenCV cannot monitor your terminal for input so if you a press a key in the terminal OpenCV will not notice. Again, you will need to click the active OpenCV window on your screen and press a key on your keyboard. ### Accessing individual pixels Figure 2: Top: grayscale gradient where brighter pixels are closer to 255 and darker pixels are closer to 0. Bottom: RGB venn diagram where brighter pixels are closer to the center. First, you may ask: What is a pixel? All images consist of pixels which are the raw building blocks of images. Images are made of pixels in a grid. A 640 x 480 image has 640 rows and 480 columns. There are 640 * 480 = 307200 pixels in an image with those dimensions. Each pixel in a grayscale image has a value representing the shade of gray. In OpenCV, there are 256 shades of gray — from 0 to 255. So a grayscale image would have a grayscale value associated with each pixel. Pixels in a color image have additional information. There are several color spaces that you’ll soon become familiar with as you learn about image processing. For simplicity let’s only consider the RGB color space. In OpenCV color images in the RGB (Red, Green, Blue) color space have a 3-tuple associated with each pixel: (B, G, R) . Notice the ordering is BGR rather than RGB. This is because when OpenCV was first being developed many years ago the standard was BGR ordering. Over the years, the standard has now become RGB but OpenCV still maintains this “legacy” BGR ordering to ensure no existing code breaks. Each value in the BGR 3-tuple has a range of [0, 255] . How many color possibilities are there for each pixel in an RGB image in OpenCV? That’s easy: 256 * 256 * 256 = 16777216 . Now that we know exactly what a pixel is, let’s see how to retrieve the value of an individual pixel in the image: # access the RGB pixel located at x=50, y=100, keepind in mind that # OpenCV stores images in BGR order rather than RGB (B, G, R) = image[100, 50] print("R={}, G={}, B={}".format(R, G, B)) As shown previously, our image dimensions are width=600, height=322, depth=3 . We can access individual pixel values in the array by specifying the coordinates so long as they are within the max width and height. The code, image[100, 50] , yields a 3-tuple of BGR values from the pixel located at x=50 and y=100 (again, keep in mind that the height is the number of rows and the width is the number of columns — take a second now to convince yourself this is true). As stated above, OpenCV stores images in BGR ordering (unlike Matplotlib, for example). Check out how simple it is to extract the color channel values for the pixel on Line 19. The resulting pixel value is shown on the terminal here: R=41, G=49, B=37 ### Array slicing and cropping Extracting “regions of interest” (ROIs) is an important skill for image processing. Say, for example, you’re working on recognizing faces in a movie. First, you’d run a face detection algorithm to find the coordinates of faces in all the frames you’re working with. Then you’d want to extract the face ROIs and either save them or process them. Locating all frames containing Dr. Ian Malcolm in Jurassic Park would be a great face recognition mini-project to work on. For now, let’s just manually extract an ROI. This can be accomplished with array slicing. Figure 3: Array slicing with OpenCV allows us to extract a region of interest (ROI) easily. # extract a 100x100 pixel square ROI (Region of Interest) from the # input image starting at x=320,y=60 at ending at x=420,y=160 roi = image[60:160, 320:420] cv2.imshow("ROI", roi) cv2.waitKey(0) Array slicing is shown on Line 24 with the format: image[startY:endY, startX:endX] . This code grabs an roi which we then display on Line 25. Just like last time, we display until a key is pressed (Line 26). As you can see in Figure 3, we’ve extracted the face of Dr. Ian Malcolm. I actually predetermined the (x, y)-coordinates using Photoshop for this example, but if you stick with me on the blog you could detect and extract face ROI’s automatically. ### Resizing images Resizing images is important for a number of reasons. First, you might want to resize a large image to fit on your screen. Image processing is also faster on smaller images because there are fewer pixels to process. In the case of deep learning, we often resize images, ignoring aspect ratio, so that the volume fits into a network which requires that an image be square and of a certain dimension. Let’s resize our original image to 200 x 200 pixels: # resize the image to 200x200px, ignoring aspect ratio resized = cv2.resize(image, (200, 200)) cv2.imshow("Fixed Resizing", resized) cv2.waitKey(0) On Line 29, we have resized an image ignoring aspect ratio. Figure 4 (right) shows that the image is resized but is now distorted because we didn’t take into account the aspect ratio. Figure 4: Resizing an image with OpenCV and Python can be conducted with cv2.resize however aspect ratio is not preserved automatically. Let’s calculate the aspect ratio of the original image and use it to resize an image so that it doesn’t appear squished and distorted: # fixed resizing and distort aspect ratio so let's resize the width # to be 300px but compute the new height based on the aspect ratio r = 300.0 / w dim = (300, int(h * r)) resized = cv2.resize(image, dim) cv2.imshow("Aspect Ratio Resize", resized) cv2.waitKey(0) Recall back to Line 9 of this script where we extracted the width and height of the image. Let’s say that we want to take our 600-pixel wide image and resize it to 300 pixels wide while maintaining aspect ratio. On Line 35 we calculate the ratio of the new width to the old width (which happens to be 0.5). From there, we specify our dimensions of the new image, dim . We know that we want a 300-pixel wide image, but we must calculate the height using the ratio by multiplying h by r (the original height and our ratio respectively). Feeding dim (our dimensions) into the cv2.resize function, we’ve now obtained a new image named resized which is not distorted (Line 37). To check our work, we display the image using the code on Line 38: Figure 5: Resizing images while maintaining aspect ratio with OpenCV is a three-step process: (1) extract the image dimensions, (2) compute the aspect ratio, and (3) resize the image (cv2.resize) along one dimension and multiply the other dimension by the aspect ratio. See Figure 6 for an even easier method. But can we make this process of preserving aspect ratio during resizing even easier? Yes! Computing the aspect ratio each time we want to resize an image is a bit tedious, so I wrapped the code in a function within imutils . Here is how you may use imutils.resize : # manually computing the aspect ratio can be a pain so let's use the # imutils library instead resized = imutils.resize(image, width=300) cv2.imshow("Imutils Resize", resized) cv2.waitKey(0) In a single line of code, we’ve preserved aspect ratio and resized the image. Simple right? All you need to provide is your target width or target height as a keyword argument (Line 43). Here’s the result: Figure 6: If you’d like to maintain aspect ratio while resizing images with OpenCV and Python, simply use imutils.resize. Now your image won’t risk being “squished” as in Figure 4. ### Rotating an image Let’s rotate our Jurassic Park image for our next example: # let's rotate an image 45 degrees clockwise using OpenCV by first # computing the image center, then constructing the rotation matrix, # and then finally applying the affine warp center = (w // 2, h // 2) M = cv2.getRotationMatrix2D(center, -45, 1.0) rotated = cv2.warpAffine(image, M, (w, h)) cv2.imshow("OpenCV Rotation", rotated) cv2.waitKey(0) Rotating an image about the center point requires that we first calculate the center (x, y)-coordinates of the image (Line 50). Note: We use // to perform integer math (i.e., no floating point values). From there we calculate a rotation matrix, M (Line 51). The -45 means that we’ll rotate the image 45 degrees clockwise. Recall from your middle/high school geometry class about the unit circle and you’ll be able to remind yourself that positive angles are counterclockwise and negative angles are clockwise. From there we warp the image using the matrix (effectively rotating it) on Line 52. The rotated image is displayed to the screen on Line 52 and is shown in Figure 7: Figure 7: Rotating an image with OpenCV about the center point requires three steps: (1) compute the center point using the image width and height, (2) compute a rotation matrix with cv2.getRotationMatrix2D, and (3) use the rotation matrix to warp the image with cv2.warpAffine. Now let’s perform the same operation in just a single line of code using imutils : # rotation can also be easily accomplished via imutils with less code rotated = imutils.rotate(image, -45) cv2.imshow("Imutils Rotation", rotated) cv2.waitKey(0) Since I don’t have to rotate image as much as resizing them (comparatively) I find the rotation process harder to remember. Therefore, I created a function in imutils to handle it for us. In a single line of code, I can accomplish rotating the image 45 degrees clockwise (Line 57) as in Figure 8: Figure 8: With imutils.rotate, we can rotate an image with OpenCV and Python conveniently with a single line of code. At this point you have to be thinking: Why in the world is the image clipped? The thing is, OpenCV doesn’t care if our image is clipped and out of view after the rotation. I find this to be quite bothersome, so here’s my imutils version which will keep the entire image in view. I call it rotate_bound : # OpenCV doesn't "care" if our rotated image is clipped after rotation # so we can instead use another imutils convenience function to help # us out rotated = imutils.rotate_bound(image, 45) cv2.imshow("Imutils Bound Rotation", rotated) cv2.waitKey(0) There’s a lot going on behind the scenes of rotate_bound . If you’re interested in how the method on Line 64 works, be sure to check out this blog post. The result is shown in Figure 9: Figure 9: The rotate_bound function of imutils will prevent OpenCV from clipping the image during a rotation. See this blog post to learn how it works! Perfect! The entire image is in the frame and it is correctly rotated 45 degrees clockwise. ### Smoothing an image In many image processing pipelines, we must blur an image to reduce high-frequency noise, making it easier for our algorithms to detect and understand the actual contents of the image rather than just noise that will “confuse” our algorithms. Blurring an image is very easy in OpenCV and there are a number of ways to accomplish it. Figure 10: This image has undergone a Gaussian blur with an 11 x 11 kernel using OpenCV. Blurring is an important step of many image processing pipelines to reduce high-frequency noise. I often use the GaussianBlur function: # apply a Gaussian blur with a 11x11 kernel to the image to smooth it, # useful when reducing high frequency noise blurred = cv2.GaussianBlur(image, (11, 11), 0) cv2.imshow("Blurred", blurred) cv2.waitKey(0) On Line 70 we perform a Gaussian Blur with an 11 x 11 kernel the result of which is shown in Figure 10. Larger kernels would yield a more blurry image. Smaller kernels will create less blurry images. To read more about kernels, refer to this blog post or the PyImageSearch Gurus course. ### Drawing on an image In this section, we’re going to draw a rectangle, circle, and line on an input image. We’ll also overlay text on an image as well. Before we move on with drawing on an image with OpenCV, take note that drawing operations on images are performed in-place. Therefore at the beginning of each code block, we make a copy of the original image storing the copy as output . We then proceed to draw on the image called output in-place so we do not destroy our original image. Let’s draw a rectangle around Ian Malcolm’s face: # draw a 2px thick red rectangle surrounding the face output = image.copy() cv2.rectangle(output, (320, 60), (420, 160), (0, 0, 255), 2) cv2.imshow("Rectangle", output) cv2.waitKey(0) First, we make a copy of the image on Line 75 for reasons just explained. Then we proceed to draw the rectangle. Drawing rectangles in OpenCV couldn’t be any easier. Using pre-calculated coordinates, I’ve supplied the following parameters to the cv2.rectangle function on Line 76: • img : The destination image to draw upon. We’re drawing on output . • pt1 : Our starting pixel coordinate which is the top-left. In our case, the top-left is (320, 60) . • pt2 : The ending pixel — bottom-right. The bottom-right pixel is located at (420, 160) . • color : BGR tuple. To represent red, I’ve supplied (0 , 0, 255) . • thickness : Line thickness (a negative value will make a solid rectangle). I’ve supplied a thickness of 2 . Since we are using OpenCV’s functions rather than NumPy operations we can supply our coordinates in (x, y) order rather than (y, x) since we are not manipulating or accessing the NumPy array directly — OpenCV is taking care of that for us. Here’s our result in Figure 11: Figure 11: Drawing shapes with OpenCV and Python is an easy skill to pick up. In this image, I’ve drawn a red box using cv2.rectangle. I pre-determined the coordinates around the face for this example, but you could use a face detection method to automatically find the face coordinates. And now let’s place a solid blue circle in front of Dr. Ellie Sattler’s face: # draw a blue 20px (filled in) circle on the image centered at # x=300,y=150 output = image.copy() cv2.circle(output, (300, 150), 20, (255, 0, 0), -1) cv2.imshow("Circle", output) cv2.waitKey(0) To draw a circle, you need to supply following parameters to cv2.circle : • img : The output image. • center : Our circle’s center coordinate. I supplied (300, 150) which is right in front of Ellie’s eyes. • radius : The circle radius in pixels. I provided a value of 20 pixels. • color : Circle color. This time I went with blue as is denoted by 255 in the B and 0s in the G + R components of the BGR tuple, (255, 0, 0) . • thickness : The line thickness. Since I supplied a negative value ( -1 ), the circle is solid/filled in. Here’s the result in Figure 12: Figure 12: OpenCV’s cv2.circle method allows you to draw circles anywhere on an image. I’ve drawn a solid circle for this example as is denoted by the -1 line thickness parameter (positive values will make a circular outline with variable line thickness). It looks like Ellie is more interested in the dinosaurs than my big blue dot, so let’s move on! Next, we’ll draw a red line. This line goes through Ellie’s head, past her eye, and to Ian’s hand. If you look carefully at the method parameters and compare them to that of the rectangle, you’ll notice that they are identical: # draw a 5px thick red line from x=60,y=20 to x=400,y=200 output = image.copy() cv2.line(output, (60, 20), (400, 200), (0, 0, 255), 5) cv2.imshow("Line", output) cv2.waitKey(0) Just as in a rectangle, we supply two points, a color, and a line thickness. OpenCV’s backend does the rest. Figure 13 shows the result of Line 89 from the code block: Figure 13: Similar to drawing rectangles and circles, drawing a line in OpenCV using cv2.line only requires a starting point, ending point, color, and thickness. Oftentimes you’ll find that you want to overlay text on an image for display purposes. If you’re working on face recognition you’ll likely want to draw the person’s name above their face. Or if you advance in your computer vision career you may build an image classifier or object detector. In these cases, you’ll find that you want to draw text containing the class name and probability. Let’s see how OpenCV’s putText function works: # draw green text on the image output = image.copy() cv2.putText(output, "OpenCV + Jurassic Park!!!", (10, 25), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 0), 2) cv2.imshow("Text", output) cv2.waitKey(0) The putText function of OpenCV is responsible for drawing text on an image. Let’s take a look at the required parameters: • img : The output image. • text : The string of text we’d like to write/draw on the image. • pt : The starting point for the text. • font : I often use the cv2.FONT_HERSHEY_SIMPLEX . The available fonts are listed here. • scale : Font size multiplier. • color : Text color. • thickness : The thickness of the stroke in pixels. The code on Lines 95 and 96 will draw the text, “OpenCV + Jurassic Park!!!” in green on our output image in Figure 14: Figure 14: Oftentimes, you’ll find that you want to display text on an image for visualization purposes. Using the cv2.putText code shown above you can practice overlaying text on an image with different colors, fonts, sizes, and/or locations. ### Running the first OpenCV tutorial Python script In my blog posts, I generally provide a section detailing how you can run the code on your computer. At this point in the blog post, I make the following assumptions: 1. You have downloaded the code from the “Downloads” section of this blog post. 2. You have unzipped the files. 3. You have installed OpenCV and the imutils library on your system. To execute our first script, open a terminal or command window and navigate to the files or extract them if necessary. From there, enter the following command: $ python opencv_tutorial_01.py
width=600, height=322, depth=3
R=41, G=49, B=37

The command is everything after the bash prompt

$ character. Just type python opencv_tutorial_01.py in your terminal and then the first image will appear. To cycle through each step that we just learned, make sure an image window is active, and press any key. Our first couple code blocks above told Python to print information in the terminal. If your terminal is visible, you’ll see the terminal output (Lines 2 and 3) shown. I’ve also included a GIF animation demonstrating all the image processing steps we took sequentially, one right after the other: Figure 15: Output animation displaying the OpenCV fundamentals we learned from this first example Python script. ### Counting objects Now we’re going to shift gears and work on the second script included in the “Downloads” associated with this blog post. In the next few sections we’ll learn how to use create a simple Python + OpenCV script to count the number of Tetris blocks in the following image: Figure 16: If you’ve ever played Tetris (who hasn’t?), you’ll recognize these familiar shapes. In the 2nd half of this OpenCV fundamentals tutorial, we’re going to find and count the shape contours. Along the way we’ll be: • Learning how to convert images to grayscale with OpenCV • Performing edge detection • Thresholding a grayscale image • Finding, counting, and drawing contours • Conducting erosion and dilation • Masking an image Go ahead and close the first script you downloaded and open up opencv_tutorial_02.py to get started with the second example: # import the necessary packages import argparse import imutils import cv2 # construct the argument parser and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-i", "--image", required=True, help="path to input image") args = vars(ap.parse_args()) On Lines 2-4 we import our packages. This is necessary at the start of each Python script. For this second script, I’ve imported argparse — a command line arguments parsing package which comes with all installations of Python. Take a quick glance at Lines 7-10. These lines allow us to provide additional information to our program at runtime from within the terminal. Command line arguments are used heavily on the PyImageSearch blog and in all other computer science fields as well. I encourage you to read about them on this post: Python, argparse, and command line arguments. We have one required command line argument --image , as is defined on Lines 8 and 9. We’ll learn how to run the script with the required command line argument down below. For now, just know that wherever you encounter args["image"] in the script, we’re referring to the path to the input image. ### Converting an image to grayscale # load the input image (whose path was supplied via command line # argument) and display the image to our screen image = cv2.imread(args["image"]) cv2.imshow("Image", image) cv2.waitKey(0) # convert the image to grayscale gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) cv2.imshow("Gray", gray) cv2.waitKey(0) We load the image into memory on Line 14. The parameter to the cv2.imread function is our path contained in the args dictionary referenced with the "image" key, args["image"] . From there, we display the image until we encounter our first keypress (Lines 15 and 16). We’re going to be thresholding and detecting edges in the image shortly. Therefore we convert the image to grayscale on Line 19 by calling cv2.cvtColor and providing the image and cv2.COLOR_BGR2GRAY flag. Again we display the image and wait for a keypress (Lines 20 and 21). The result of our conversion to grayscale is shown in Figure 17 (bottom). Figure 17: (top) Our Tetris image. (bottom) We’ve converted the image to grayscale — a step that comes before thresholding. ### Edge detection Edge detection is useful for finding boundaries of objects in an image — it is effective for segmentation purposes. Let’s perform edge detection to see how the process works: # applying edge detection we can find the outlines of objects in # images edged = cv2.Canny(gray, 30, 150) cv2.imshow("Edged", edged) cv2.waitKey(0) Using the popular Canny algorithm (developed by John F. Canny in 1986), we can find the edges in the image. We provide three parameters to the cv2.Canny function: • img : The gray image. • minVal : A minimum threshold, in our case 30 . • maxVal : The maximum threshold which is 150 in our example. • aperture_size : The Sobel kernel size. By default this value is 3 and hence is not shown on Line 25. Different values for the minimum and maximum thresholds will return different edge maps. In Figure 18 below, notice how edges of Tetris blocks themselves are revealed along with sub-blocks that make up the Tetris block: Figure 18: To conduct edge detection with OpenCV, we make use of the Canny algorithm. ### Thresholding Image thresholding is an important intermediary step for image processing pipelines. Thresholding can help us to remove lighter or darker regions and contours of images. I highly encourage you to experiment with thresholding. I tuned the following code to work for our example by trial and error (as well as experience): # threshold the image by setting all pixel values less than 225 # to 255 (white; foreground) and all pixel values >= 225 to 255 # (black; background), thereby segmenting the image thresh = cv2.threshold(gray, 225, 255, cv2.THRESH_BINARY_INV)[1] cv2.imshow("Thresh", thresh) cv2.waitKey(0) In a single line (Line 32) we are: • Grabbing all pixels in the gray image greater than 225 and setting them to 0 (black) which corresponds to the background of the image • Setting pixel vales less than 225 to 255 (white) which corresponds to the foreground of the image (i.e., the Tetris blocks themselves). For more information on the cv2.threshold function, including how the thresholding flags work, be sure to refer to official OpenCV documentation. Segmenting foreground from background with a binary image is critical to finding contours (our next step). Figure 19: Prior to finding contours, we threshold the grayscale image. We performed a binary inverse threshold so that the foreground shapes become white while the background becomes black. Notice in Figure 19 that the foreground objects are white and the background is black. ### Detecting and drawing contours Figure 20: We’re working towards finding contour shapes with OpenCV and Python in this OpenCV Basics tutorial. Pictured in the Figure 20 animation, we have 6 shape contours. Let’s find and draw their outlines via code: # find contours (i.e., outlines) of the foreground objects in the # thresholded image cnts = cv2.findContours(thresh.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) cnts = cnts[0] if imutils.is_cv2() else cnts[1] output = image.copy() # loop over the contours for c in cnts: # draw each contour on the output image with a 3px thick purple # outline, then display the output contours one at a time cv2.drawContours(output, [c], -1, (240, 0, 159), 3) cv2.imshow("Contours", output) cv2.waitKey(0) On Lines 38 and 39, we use cv2.findContours to detect the contours in the image. Take note of the parameter flags but for now let’s keep things simple — our algorithm is finding all foreground (white) pixels in the thresh.copy() image. Line 40 is very important accounting for the fact that cv2.findContours implementation changed between OpenCV 2.4 and OpenCV 3. This compatibility line is present on the blog wherever contours are involved. We make a copy of the original image on Line 41 so that we can draw contours on subsequent Lines 44-49. On Line 47 we draw each c from the cnts list on the image using the appropriately named cv2.drawContours . I chose purple which is represented by the tuple (240, 0, 159) . Using what we learned earlier in this blog post, let’s overlay some text on the image: # draw the total number of contours found in purple text = "I found {} objects!".format(len(cnts)) cv2.putText(output, text, (10, 25), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (240, 0, 159), 2) cv2.imshow("Contours", output) cv2.waitKey(0) Line 52 builds a text string containing the number of shape contours. Counting the total number of objects in this image is as simple as checking the length of the contours list — len(cnts) . The result is shown in Figure 21: Figure 21: Counting contours with OpenCV is as easy as finding them and then calling len(cnts). ### Erosions and dilations Erosions and dilations are typically used to reduce noise in binary images (a side effect of thresholding). To reduce the size of foreground objects we can erode away pixels given a number of iterations: # we apply erosions to reduce the size of foreground objects mask = thresh.copy() mask = cv2.erode(mask, None, iterations=5) cv2.imshow("Eroded", mask) cv2.waitKey(0) On Line 59 we copy the thresh image while naming it mask . Then, utilizing cv2.erode , we proceed to reduce the contour sizes with 5 iterations (Line 60). Demonstrated in Figure 22, the masks generated from the Tetris contours are slightly smaller: Figure 22: Using OpenCV we can erode contours, effectively making them smaller or causing them to disappear completely with sufficient iterations. This is typically useful for removing small blobs in mask image. Similarly, we can foreground regions in the mask. To enlarge the regions, simply use cv2.dilate : # similarly, dilations can increase the size of the ground objects mask = thresh.copy() mask = cv2.dilate(mask, None, iterations=5) cv2.imshow("Dilated", mask) cv2.waitKey(0) Figure 23: In an image processing pipeline if you ever have the need to connect nearby contours, you can apply dilation to the image. Shown in the figure is the result of dilating contours with five iterations, but not to the point of two contours becoming one. ### Masking and bitwise operations Masks allow us to “mask out” regions of an image we are uninterested in. We call them “masks” because they will hide regions of images we do not care about. If we use the thresh image from Figure 18 and mask it with the original image, we’re presented with Figure 23: Figure 24: When using the thresholded image as the mask in comparison to our original image, the colored regions reappear as the rest of the image is “masked out”. This is, of course, a simple example, but as you can imagine, masks are very powerful. In Figure 24, the background is black now and our foreground consists of colored pixels — any pixels masked by our mask image. Let’s learn how to accomplish this: # a typical operation we may want to apply is to take our mask and # apply a bitwise AND to our input image, keeping only the masked # regions mask = thresh.copy() output = cv2.bitwise_and(image, image, mask=mask) cv2.imshow("Output", output) cv2.waitKey(0) The mask is generated by copying the binary thresh image (Line 73). From there we bitwise AND the pixels from both images together using cv2.bitwise_and . The result is Figure 24 above where now we’re only showing/highlighting the Tetris blocks. ### Running the second OpenCV tutorial Python script To run the second script, be sure you’re in the folder containing your downloaded source code and Python scripts. From there, we’ll open up a terminal provide the script name + command line argument: $ python opencv_tutorial_02.py --image tetris_blocks.png

The argument flag is

--image
and the image argument itself is
tetris_blocks.png
— a path to the relevant file in the directory.

There is no terminal output for this script. Again, to cycle through the images, be sure you click on an image window to make it active, from there you can press a key and it will be captured to move forward to the next

waitKey(0)
in the script. When the program is finished running, your script will exit gracefully and you’ll be presented with a new bash prompt line in your terminal.

Below I have included a GIF animation of the basic OpenCV image processing steps in our example script:

Figure 25: Learning OpenCV and the basics of computer vision by counting objects via contours.

If you’re looking to continue learning OpenCV and computer vision, be sure to take a look at my book, Practical Python and OpenCV.

Inside the book we’ll explore the OpenCV fundamentals we discussed here today in more detail.

You’ll also learn how to use these fundamentals to build actual computer vision + OpenCV applications, including:

• Face detection in images and video
• Handwriting recognition
• Feature extraction and machine learning
• Basic object tracking
• …and more!

## Summary

In today’s blog post you learned the fundamentals of image processing and OpenCV using the Python programming language.

You are now prepared to start using these image processing operations as “building blocks” you can chain together to build an actual computer vision application — a great example of such a project is the basic object counter we created by counting contours.

I hope this tutorial helped you learn OpenCV!

The post OpenCV Tutorial: A Guide to Learn OpenCV appeared first on PyImageSearch.

### Explaining the 68-95-99.7 rule for a Normal Distribution

This post explains how those numbers were derived in the hope that they can be more interpretable for your future endeavors.

### Charles River Analytics: Software Engineer II – Intelligent Systems

Seeking an experienced and enthusiastic Software Engineer to design and develop cutting-edge intelligent systems applied to areas such as intelligent tutoring, serious games, crowdsourcing, skill modeling and assessment, and advanced visualization.

### “The idea of replication is central not just to scientific practice but also to formal statistics . . . Frequentist statistics relies on the reference set of repeated experiments, and Bayesian statistics relies on the prior distribution which represents the population of effects.”

Rolf Zwaan (who we last encountered here in “From zero to Ted talk in 18 simple steps”), Alexander Etz, Richard Lucas, and M. Brent Donnellan wrote an article, “Making replication mainstream,” which begins:

Many philosophers of science and methodologists have argued that the ability to repeat studies and obtain similar results is an essential component of science. . . . To address the need for an integrative summary, we review various types of replication studies and then discuss the most commonly voiced concerns about direct replication. We provide detailed responses to these concerns and consider different statistical ways to evaluate replications. We conclude there are no theoretical or statistical obstacles to making direct replication a routine aspect of psychological science.

The article was published in Behavioral and Brain Sciences, a journal that runs articles with many discussants (see here for an example from a few years back).

I wrote a discussion, “Don’t characterize replications as successes or failures”:

No replication is truly direct, and I recommend moving away from the classification of replications as “direct” or “conceptual” to a framework in which we accept that treatment effects vary across conditions. Relatedly, we should stop labeling replications as successes or failures and instead use continuous measures to compare different studies, again using meta-analysis of raw data where possible. . . .

I also agree that various concerns about the difficulty of replication should, in fact, be interpreted as arguments in favor of replication. For example, if effects can vary by context, this provides more reason why replication is necessary for scientific progress. . . .

It may well make sense to assign lower value to replications than to original studies, when considered as intellectual products, as we can assume the replication requires less creative effort. When considered as scientific evidence, however, the results from a replication can well be better than those of the original study, in that the replication can have more control in its design, measurement, and analysis. . . .

Beyond this, I would like to add two points from a statistician’s perspective.

First, the idea of replication is central not just to scientific practice but also to formal statistics, even though this has not always been recognized. Frequentist statistics relies on the reference set of repeated experiments, and Bayesian statistics relies on the prior distribution which represents the population of effects—and in the analysis of replication studies it is important for the model to allow effects to vary across scenarios.

My second point is that in the analysis of replication studies I recommend continuous analysis and multilevel modeling (meta-analysis), in contrast to the target article which recommends binary decision rules which which I think are contrary to the spirit of inquiry that motivates replication in the first place.

Jennifer Tackett and Blake McShane wrote a discussion, “Conceptualizing and evaluating replication across domains of behavioral research,” which begins:

We discuss the authors’ conceptualization of replication, in particular the false dichotomy of direct versus conceptual replication intrinsic to it, and suggest a broader one that better generalizes to other domains of psychological research. We also discuss their approach to the evaluation of replication results and suggest moving beyond their dichotomous statistical paradigms and employing hierarchical / meta-analytic statistical models.

Also relevant is this talk on Bayes, statistics, and reproducibility from earlier this year.

### Five Ways Data Is Assisting the Construction & Civil Engineering Industry

If you think Big Data only relates to business intelligence or online analytics, you really need to read this article.   Data is absolutely taking over almost every sector, even the ones that you may not immediately relate data to. For instance,  construction management and civil engineering. The construction management

The post Five Ways Data Is Assisting the Construction & Civil Engineering Industry appeared first on Dataconomy.

### Analysis: Do the shoes matter in marathon running?

Kevin Quealy and Josh Katz for The Upshot analyzed shoe and running data to see if Nike’s Vaporfly running shoes really helped marathoners achieve faster times. Accounting for a number of confounding factors, the results appear to point to yes.

We found that the difference was not explained by faster runners choosing to wear the shoes, by runners choosing to wear them in easier races or by runners switching to Vaporflys after running more training miles. Instead, the analysis suggests that, in a race between two marathoners of the same ability, a runner wearing Vaporflys would have a real advantage over a competitor not wearing them.

Very statistics-y, even for The Upshot. I like it.

It takes me back to my fourth grade science fair project where I asked: Do Nike’s really make you jump higher? Our results pointed to yes too. Although our sample size of five with no control or statistical rigor might not stand up to more technical standards. My Excel charts were dope though.

Tags: , , ,

### Automated Text Feature Engineering using textfeatures in R

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

It could be the era of Deep Learning where it really doesn’t matter how big is your dataset or how many columns you’ve got. Still, a lot of Kaggle Competition Winners and Data Scientists emphasis on one thing that could put you on the top of the leaderboard in a Competition is “Feature Engineering”. Irrespective of how sophisticated your model is, good features will always help your Machine Learning Model building process better than others.

### What is Feature engineering?

Features are nothing but columns/dimensions and Feature Engineering is the process of creating new Features or Predictors based on Domain Knowledge or Statistical Principles. Feature Engineering has always been around with Machine Learning but the latest in that is Automated Feature Engineering which has become a thing recently with Researchers started using Machine Learning itself to create new Features that can help in Model Accuracy. While most of the automated Featuring Engineering address numeric data, Text Data has always been left out in this race because of its inherent unstructured nature. No more, I could say.

### textfeatures – R package

Michael Kearney, Assistant Professor in University of Missouri, well known in the R community for the modern twitter package rtweet, has come up with a new R packaged called textfeatures that basically generates a bunch of features for any text data that you supply. Before you dream of Deep Learning based Package for Automated Text Feature Engineering, This isn’t that. This uses very simple Text Analysis principles and generates features like Number of Upper Case letters, Number of Punctuations – plain simple stuff and nothing fancy but pretty useful ones.

### Installation

textfeatures can be installed directly from CRAN and the development version is available on github.

install.packages("textfeatures")


## Use Case

In this post, we will use textfeatures package to generate features for Fifa official world cup ios app reviews from the UK. We will R package itunesr to extract reviews and tidyverse for data manipulation and plotting.

Let us load all the required packages.

#install.packages("itunesr")
#install.packages("textfeatures")
#install.packages("tidyverse")

library(itunesr)
library(textfeatures)
library(tidyverse)



### Extracting recent Reviews:

#Get UK Reviews of Fifa official world cup ios app
#https://itunes.apple.com/us/app/2018-fifa-world-cup-russia/id756904853?mt=8

reviews1 <- getReviews(756904853,"GB",1)
reviews2 <- getReviews(756904853,"GB",2)
reviews3 <- getReviews(756904853,"GB",3)
reviews4 <- getReviews(756904853,"GB",4)

#Combining all the reviews into one dataframe
reviews <- rbind(reviews1,
reviews2,
reviews3,
reviews4)


### textfeatures Magic Begins:

As we have got the reviews, Let us allow textfeatures to do its magic. We will use the function textfeatures() to do that.

#Combining all the reviews into one dataframe
reviews <- rbind(reviews1,
reviews2,
reviews3,
reviews4)

# generate text features
feat <- textfeatures(reviews$Review) # check what all features generated glimpse(feat) Observations: 200 Variables: 17$ n_chars         149, 13, 263, 189, 49, 338, 210, 186, 76, 14, 142, 114, 242, ...
$n_commas 1, 0, 0, 0, 0, 1, 2, 1, 1, 0, 1, 1, 0, 3, 0, 0, 1, 0, 3, 1, 0...$ n_digits        0, 0, 6, 3, 0, 4, 1, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0...
$n_exclaims 0, 0, 2, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0, 0...$ n_extraspaces   1, 0, 0, 0, 0, 3, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 4, 0...
$n_hashtags 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...$ n_lowers        140, 11, 225, 170, 46, 323, 195, 178, 70, 12, 129, 106, 233, ...
$n_lowersp 0.9400000, 0.8571429, 0.8560606, 0.9000000, 0.9400000, 0.9557...$ n_mentions      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$n_periods 2, 0, 5, 1, 0, 0, 3, 1, 2, 0, 2, 1, 1, 2, 0, 0, 4, 2, 0, 4, 0...$ n_urls          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$n_words 37, 2, 55, 45, 12, 80, 42, 41, 14, 3, 37, 28, 50, 16, 15, 8, ...$ n_caps          4, 1, 12, 8, 2, 7, 3, 4, 2, 2, 6, 4, 6, 2, 3, 1, 6, 4, 29, 9,...
$n_nonasciis 0, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...$ n_puncts        2, 1, 7, 0, 1, 3, 4, 1, 1, 0, 4, 2, 2, 0, 1, 0, 1, 1, 0, 7, 2...
$n_capsp 0.03333333, 0.14285714, 0.04924242, 0.04736842, 0.06000000, 0...$ n_charsperword  3.947368, 4.666667, 4.714286, 4.130435, 3.846154, 4.185185, 4...



As you can see above, textfeatures have created 17 new features. Please note that these features will remain same for any text data.

### Visualizing the outcome:

For this post, we wouldn’t build a Machine Learning Model but these features can very well be used for build a Classification model like Sentiment Classification or Category Classification.

But right now, we will just visualize the outcome with some features.

We can see if there’s any relations between a number of characters and number of characters per word, with respect to the Review rating. A hypothesis could be that people who give good rating wouldn’t write long or otherwise. We are not going to validate it here, but just visualizing using a scatter plot.

# merging features with original reviews
reviews_all %
ggplot(aes(n_charsperword, n_chars, colour = Rating)) + geom_point()


Gives this plot:

Let’s bring in a different perspective to the same hypothesis with a different plot but while comparing against a number of words instead of a number of characters.

reviews_all %>%
ggplot(aes(n_charsperword, n_words)) + geom_point() +
facet_wrap(~Rating) +
stat_smooth()


Gives this plot:

Thus, you can use textfeatures to automatically generate new features and make a better understanding of your text data. Hope this post helps you get started with this beautiful package and if you’d like to know more on Text Analysis check out this tutorial by Julia Silge. The complete code used here is available on my github.

Related Post

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### EARL US Roadshow 2018 – agenda announced

(This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers)

We are delighted to announce that the full EARL US Roadshow agendas are now available! Each city has a different line-up of world-class speakers and R experts for you to explore. EARL focuses on the business application of R and will showcase many fascinating and inspiring use cases from a range of industry sectors.

We wanted to share with you just some of the talks we are looking forward to in each city:

### Seattle – 7 November

See the full agenda.

R use cases: Examining influence of roadway design features on drivers’ speed choice in Washington – Joonbum Lee, Battelle Memorial Institute
Joonbum will present on R use cases for transportation safety and how he uses R to identify roadway characteristics associated with speeding – he is an industrial engeering Ph.D with a background in Human Factors and driving safety.

Building and deploying an image recognition model in R using Google Cloud and RStudio – Christopher Crosbie, Google
For R users looking to leverage cloud computing, Christopher’s presentation will cover building an image recognition model using Google Cloud ML Engine and R Studio and then deploying it to the web via an R-Shiny dashboard.

### Houston – 9 November

See the full agenda.

Hadley Wickham, RStudio and Robert Gentleman, 23andMe
We are beyond thrilled to have not only one of the originators of the R language but also R’s most famous package developer joining us in Houston. We can’t wait to hear from both of our incredible keynote speakers.

R-bots for Data Science workflow automation – Sydeaka Watson, AT&T Chief Data Office
Sydeaka’s talk will cover the subject of R-bots and how to produce a truly end-to-end solution for data science work flow automation.

### Boston – 13 November

See the full agenda.

Predicting likelihood of engine failure in light duty trucks – Omari Faakye, Holman Strategic Ventures
Omari will be talking about the importance of predicting possible component failure to help avoid or reduce the increasing cost of breakdowns. His talk will walk through the challenging process of feature selection and engineering, training the model and final deployment all from a single laptop.

Not Hotdog: Image recognition with R and the Custom Vision API – David Smith, Microsoft
In David’s talk, he will use R in conjunction with the Microsoft Custom Vision API to train and use a custom vision recognizer. He will be using an example motivated by the TV series ‘Silicon Valley’ and with just a couple of hundred images of food, create a function in R that can detect whether or not a given image contains a hot dog.

We hope you’re as excited as we are about the EARL US Roadshow, we can’t wait to meet R-enthusiasts from all over the US.

Make sure you don’t miss out on our early bird tickets – which are on sale until 31 August. Get yours now!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Specialized hardware for deep learning will unleash innovation

The O’Reilly Data Show Podcast: Andrew Feldman on why deep learning is ushering a golden age for compute architecture.

In this episode of the Data Show, I spoke with Andrew Feldman, founder and CEO of Cerebras Systems, a startup in the blossoming area of specialized hardware for machine learning. Since the release of AlexNet in 2012, we have seen an explosion in activity in machine learning, particularly in deep learning. A lot of the work to date happened primarily on general purpose hardware (CPU, GPU). But now that we’re six years into the resurgence in interest in machine learning and AI, these new workloads have attracted technologists and entrepreneurs who are building specialized hardware for both model training and inference, in the data center or on edge devices.

Continue reading Specialized hardware for deep learning will unleash innovation.

### Four Short Links: 19 July 2018

Microrobotics, Adaptive Chips, ACM Ethics, and Data Journalism

1. DARPA's Insect-Scale Robot Olympics (IEEE) -- Yesterday, DARPA announced a new program called SHRIMP: SHort-Range Independent Microrobotic Platforms. The goal is “to develop and demonstrate multi-functional micro-to-milli robotic platforms for use in natural and critical disaster scenarios.”
2. DARPA Changing How Electronics Are Made (IEEE) -- Step two, to be kicked off at the summit, is something we call “software-defined hardware.” That’s where the hardware is smart enough to reconfigure itself to be the type of hardware you want, based on an analysis of the data type that you’re working on. In that case, the very hard thing is to figure out how to do that data introspection, how to reconfigure the chip on a microsecond or millisecond timescale to be what you need it to be. And more importantly, it has to monitor whether you’re right or not, so that you can iterate and be constantly evolving toward the ideal solution.
3. ACM Updates Ethics Code -- ACM revised their code of ethics to include references to emerging technology, discrimination, and data policy. They're also releasing case studies and an Ask An Ethicist advice column to help people understand how to apply the principles.
4. Data Journalism Workshop Notes -- Harkanwal Singh gave a workshop on data journalism, which yielded these excellent notes via Liza Bolton.

### CSJob: PhD and Postdoc positions KU Leuven: Optimization frameworks for deep kernel machines

Johan let me know of the following positions in his group:

Dear Igor,
could you please announce this on nuit blanche.
many thanks,
Johan

Sure thing Johan !

PhD and Postdoc positions KU Leuven: Optimization frameworks for deep kernel machines
The research group KU Leuven ESAT-STADIUS is currently offering 2 PhD and 1 Postdoc (1 year, extendable) positions within the framework of the KU Leuven C1 project Optimization frameworks for deep kernel machines (promotors: Prof. Johan Suykens and Prof. Panos Patrinos).
Deep learning and kernel-based learning are among the very powerful methods in machine learning and data-driven modelling. From an optimization and model representation point of view, training of deep feedforward neural networks occurs in a primal form, while kernel-based learning is often characterized by dual representations, in connection to possibly infinite dimensional problems in the primal. In this project we aim at investigating new optimization frameworks for deep kernel machines, with feature maps and kernels taken at multiple levels, and with possibly different objectives for the levels. The research hypothesis is that such an extended framework, including both deep feedforward networks and deep kernel machines, can lead to new important insights and improved results. In order to achieve this, we will study optimization modelling aspects (e.g. variational principles, distributed learning formulations, consensus algorithms), accelerated learning
The PhD and Postdoc positions in this KU Leuven C1 project (promotors: Prof. Johan Suykens and Prof. Panos Patrinos) relate to the following  possible topics:
-1- Optimization modelling for deep kernel machines
-2- Efficient learning schemes for deep kernel machines
-3- Adversarial learning for deep kernel machines
For further information and on-line applying, see
https://www.kuleuven.be/personeel/jobsite/jobs/54740654" (PhD positions) and
https://www.kuleuven.be/personeel/jobsite/jobs/54740649" (Postdoc position)
(click EN for English version).
The research group ESAT-STADIUS http://www.esat.kuleuven.be/stadius at the university KU Leuven Belgium provides an excellent research environment being active in the broad area of mathematical engineering, including data-driven modelling, neural networks and machine learning, nonlinear systems and complex networks, optimization, systems and control, signal processing, bioinformatics and biomedicine.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

In the last post, I linked to the article by Philadelphia Inquirer, disclosing how admissions staff at Temple University gamed rankings. They mentioned specific techniques. Here is an annotated version, with my comments:

Inflating GMAT Scores

The most interesting technique is converting GRE scores to GMAT scores before computing and reporting the average GMAT score. Presumably, those taking GREs are better students because they are applying to MS and PhD programs which require GRE and do not allow GMAT as a substitute. So, by such conversion, the average GMAT score is inflated, according to the rules which require including only those applicants with GMAT scores.

Is converting GRE to GMAT a bad thing? Not necessarily! The average GMAT score only represents the part of the applicant pool that submitted GMAT scores. As discussed above, this group is likely to be less academically brilliant than the other group who submitted GRE scores. (Chances are the average GRE score is separately reported.)

What is making trouble is the proportion of applicants submitting GRE versus GMAT scores. If a school has lots of applicants who are applying to MS/PhD programs and they submit GRE scores instead, then the average GMAT score will likely to be dragged down.

I actually think having a standard conversion formula between the two tests is great. All schools can then be evaluated on one average test score that takes into account the mixture of GREs and GMATs. The question is: is there a standard conversion formula?

Yes, the ETS provides one. This formula is based on the subset of people who have taken both tests. You can then use one score to predict the other score. Here is a PDF that explains the methodology. (By using this formula, we make the assumption that this subset of test-takers can be generalized to the subset of test-takers who took the GRE, did not take the GMAT, and are applying to business schools.)

Under-reporting student debt

The reported average student debt was diluted by including students without debt. The rating body has requested that the metric be computed only for students with debt. Mixing in zeroes brings down the average.

There are two different averages to be considered: the average debt held by students who have debt; and the average debt for all students. If we want to evaluate the school's financial aid policies, the average debt for all students gives a better answer. If you are a prospective student who will be taking out a student loan, the average debt held by students with debt is a better measure of the amount of loans required.

The link between the two metrics is the proportion of students who have debt. If the school's policy is to "spread the wealth", then the average debt load will be lower but a higher proportion of students will have debt.

Rounding up GPAs

It appeared that they crudely rounded up the average GPA. The example was rounding 3.22 to 3.30. The 0.08 increase looks innocent but applied to the average, this means adding 0.08 to everyone's GPA (to be more accurate, for each student whose GPA is 3.92 or higher, someone else's GPA got inflated by more than 0.08).

Under-reporting the number of admission offers

Selectivity is the number of offers divided by the number of applicants. Inflating the number of applicants or deflating the number of offers increases selectivity rate. According to the investigation, they blatantly lied by under-counting the number of offers. In the previous post, I described a number of techniques that are more subtle - you generate more applicants but of the kind that you are unlikely to give out offers to.

For even more tricks, read Chapter 1 of my book Numbersense.

### Distilled News

Visualizations should make the most important features of your data stand out. But too often, what’s important gets lost in the minefield of data. But now you can highlight systematic changes from random noise by adding trend lines to your chart! In Displayr, Visualizations of chart type Column, Bar, Area, Line and Scatter all support trend lines. Trend lines can be linear or non-parametric (cubic spline, Friedman´s super-smoother or LOESS).
Spark´s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). It is a fault-tolerant collection of elements which allows parallel operations upon itself. RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs.
Spark SQL is a part of Apache Spark big data framework designed for processing structured and semi-structured data. It provides a DataFrame API that simplifies and accelerates data manipulations. DataFrame is a special type of object, conceptually similar to a table in relational database. It represents a distributed collection of data organized into named columns. DataFrames can be created from external sources, retrieved with a query from a database, or converted from RDD; the inverse transform is also possible. This abstraction is designed for sampling, filtering, aggregating, and visualizing the data. In this blog post, we’re going to show you how to load a DataFrame and perform basic operations on DataFrames with both API and SQL. We’ll also go through DataFrame to RDD and vice-versa conversions.
The vast possibilities of artificial intelligence are of increasing interest in the field of modern information technologies. One of its most promising and evolving directions is machine learning (ML), which becomes the essential part in various aspects of our life. ML has found successful applications in Natural Languages Processing, Face Recognition, Autonomous Vehicles, Fraud detection, Machine vision and many other fields. Machine learning utilizes the mathematical algorithms that can solve specific tasks in a way analogous to the human brain. Depending on the neural network training method, ML algorithms can be divided into supervised (with labeled data), unsupervised (with unlabeled data), semi-supervised (there are both labeled and unlabeled data in the dataset) and reinforcement (based on reward receiving) learning. Solving the most basic and popular ML tasks, such as classification and regression, is mainly based on supervised learning algorithms. Among the variety of existing ML tools, Spark MLlib is a popular and easy-to-start library which enables training neural networks for solving the problems mentioned above. In this post, we would like to consider classification task. We will classify Iris plants to the 3 categories according to the size of their sepals and petals. The public dataset with Iris classification is available here. To move forward, download the file bezdekIris.data to the working folder.
Spark is a powerful tool which can be applied to solve many interesting problems. Some of them have been discussed in our previous posts. Today we will consider another important application, namely streaming. Streaming data is the data which continuously comes as small records from different sources. There are many use cases for streaming technology such as sensor monitoring in industrial or scientific devices, server logs checking, financial markets monitoring, etc. In this post, we will examine the case with sensors temperature monitoring. For example, we have several sensors (1,2,3,4,…) in our device. Their state is defined by the following parameters: date (dd/mm/year), sensor number, state (1 – stable, 0 – critical), and temperature (degrees Celsius). The data with the sensors state comes in streaming, and we want to analyze it. Streaming data can be loaded from the different sources. As we don´t have the real streaming data source, we should simulate it. For this purpose, we can use Kafka, Flume, and Kinesis, but the simplest streaming data simulator is Netcat.
While models and algorithms garner most of the media coverage, this is a great time to be thinking about building tools in data.
For some advanced uses cases, users might need to mount more than one data and/or outputs volumes. Polyaxon provides a way to mount multiple volumes so that user can choose which volume(s) to mount for a specific job or experiment.
Shogun is an open-source machine learning library that offers a wide range of machine learning algorithms. From my point of view it’s not very popular among professionals, but it have a lot of fans among enthusiasts and students. Library offers unified API for algorithms, so they can be easily managed, it somehow looks like to scikit-learn approach. There is a set of examples which can help you in learning of the library, but holistic documentation is missed.
Data Visualization is a big part of a data scientist´s jobs. In the early stages of a project, you´ll often be doing an Exploratory Data Analysis (EDA) to gain some insights into your data. Creating visualizations really helps make things clearer and easier to understand, especially with larger, high dimensional datasets. Towards the end of your project, it´s important to be able to present your final results in a clear, concise, and compelling manner that your audience, whom are often non-technical clients, can understand. Matplotlib is a popular Python library that can be used to create your Data Visualizations quite easily. However, setting up the data, parameters, figures, and plotting can get quite messy and tedious to do every time you do a new project. In this blog post, we´re going to look at 6 data visualizations and write some quick and easy functions for them with Python´s Matplotlib. In the meantime, here´s a great chart for selecting the right visualization for the job!
Getting an AI startup to scale for an IPO is currently elusive. Several different strategies are being discussed around the industry and here we talk about the horizontal strategy and the increasingly favored vertical strategy.
Natural language processing (NLP) is getting very popular today, which became especially noticeable in the background of the deep learning development. NLP is a field of artificial intelligence aimed at understanding and extracting important information from text and further training based on text data. The main tasks include speech recognition and generation, text analysis, sentiment analysis, machine translation, etc. In the past decades, only experts with appropriate philological education could be engaged in the natural language processing. Besides mathematics and machine learning, they should have been familiar with some key linguistic concepts. Now, we can just use already written NLP libraries. Their main purpose is to simplify the text preprocessing. We can focus on building machine learning models and hyperparameters fine-tuning. There are many tools and libraries designed to solve NLP problems. Today, we want to outline and compare the most popular and helpful natural language processing libraries, based on our experience. You should understand that all the libraries we look at have only partially overlapped tasks. So, sometimes it is hard to compare them directly. We will walk around some features and compare only those libraries, for which this is possible.
Autonomous cars are racing down the highway at speeds exceeding 100 MPH when suddenly a car a half-mile ahead blows out a tire sending dangerous debris across 3 lanes of traffic. Instead of relying upon sending this urgent, time-critical distress information to the world via the cloud, the cars on that particular section of the highway use peer-to-peer, immutable communications to inform all vehicles in the area of the danger so that they can slow down and move to unobstructed lanes (while also sending a message to the nearest highway maintenance robots to remove the debris).

### Book Memo: “R Markdown”

 The Definitive Guide R Markdown: The Definitive Guide is the first official book authored by the core R Markdown developers that provides a comprehensive and accurate reference to the R Markdown ecosystem. With R Markdown, you can easily create reproducible data analysis reports, presentations, dashboards, interactive applications, books, dissertations, websites, and journal articles, while enjoying the simplicity of Markdown and the great power of R and other languages. In this book, you will learn • Basics: Syntax of Markdown and R code chunks, how to generate figures and tables, and how to use other computing languages • Built-in output formats of R Markdown: PDF/HTML/Word/RTF/Markdown documents and ioslides/Slidy/Beamer/PowerPoint presentations • Extensions and applications: Dashboards, Tufte handouts, xaringan/reveal.js presentations, websites, books, journal articles, and interactive tutorials • Advanced topics: Parameterized reports, HTML widgets, document templates, custom output formats, and Shiny documents.

### Basic Generalised Linear Modelling – Part 1: Exercises

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

A generalized linear model (GLM) is a flexible generalization of an ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.
The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

GLMs can be split into three groups:
Poisson regression for count data with no over / under dispersion issues
Quasi-poisson or Negative binomial models where the models are overdispersed
Logistic regression models where the response data are binary (e.g. present or absent; male or female, or proportional (e.g. percentages))

In this exercise, we will focus on GLM that use Poisson regression. Please download dataset for this exercise here. The dataset is investigated the biographical determinants of at species richness at a regional scale (Gotelli and Everson, 2002). The main purpose of this exercise is to replicate the Poisson regression of ant species richness against latitude, elevation and habitat type on their paper.

Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Load the data-set and required package before running the exercise.

Exercise 1
load data and check the data structure using scatterplotMatrix function. Assess its covariation and data patterning

Exercise 2
run GLM model and run VIF analysis to check for inflation. Pay attention to the collinearity

Exercise 3
if there are any issues with the covariation try to center the predictor variables

Exercise 4
Re-run VIF with the new variables

Exercise 5
check for any influential data points outliers using influence measures (Cooks distance) and create the plot. If the value is less than 1 then it is OK to go

Exercise 6
check for over dispersion. It needs to be around 1 to go to the next step.

Exercise 7
check the model summary and what can we infer?

Exercise 8
Since we have lots of variables, then we do model averaging. The first step to do is to set options in base R regarding missing values. Then try to asses which variables that have a significant influence on the response variable. Here we include latitude, elevation, and habitat variable to produce the best model.

Exercise 9
Check validation plots

Exercise 10
Produce base plot and the points of predicted values

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Magister Dixit

“Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.” Steve Lohr ( Aug. 17, 2014 )

### nanotime 0.2.2

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new maintenance release of the nanotime package for working with nanosecond timestamps just arrived on CRAN.

nanotime uses the RcppCCTZ package for (efficient) high(er) resolution time parsing and formatting up to nanosecond resolution, and the bit64 package for the actual integer64 arithmetic. Initially implemented using the S3 system, it now uses a more rigorous S4-based approach thanks to a rewrite by Leonardo Silvestri.

This release re-disables tests for xts use. At some point we had hoped a new xts version would know what nanotime is. That xts version is out now, and it doesn’t. Our bad for making that assumption.

#### Changes in version 0.2.2 (2018-07-18)

• Unit tests depending on future xts behaviour remain disabled (Dirk in #41).

We also have a diff to the previous version thanks to CRANberries. More details and examples are at the nanotime page; code, issue tickets etc at the GitHub repository.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Document worth reading: “On the Spectral Bias of Deep Neural Networks”

It is well known that over-parametrized deep neural networks (DNNs) are an overly expressive class of functions that can memorize even random data with $100\%$ training accuracy. This raises the question why they do not easily overfit real data. To answer this question, we study deep networks using Fourier analysis. We show that deep networks with finite weights (or trained for finite number of steps) are inherently biased towards representing smooth functions over the input space. Specifically, the magnitude of a particular frequency component ($k$) of deep ReLU network function decays at least as fast as $\mathcal{O}(k^{-2})$, with width and depth helping polynomially and exponentially (respectively) in modeling higher frequencies. This shows for instance why DNNs cannot perfectly \textit{memorize} peaky delta-like functions. We also show that DNNs can exploit the geometry of low dimensional data manifolds to approximate complex functions that exist along the manifold with simple functions when seen with respect to the input space. As a consequence, we find that all samples (including adversarial samples) classified by a network to belong to a certain class are connected by a path such that the prediction of the network along that path does not change. Finally we find that DNN parameters corresponding to functions with higher frequency components occupy a smaller volume in the parameter. On the Spectral Bias of Deep Neural Networks

### Call Centre Workforce Planning Using Erlang C in R language

(This article was first published on The Devil is in the Data – The Lucid Manager, and kindly contributed to R-bloggers)

We all hate the experience of calling a service provider and being placed on hold for a very long time. Organisations that take their level of service seriously plan their call centres so that the waiting times for customers is within acceptable limits. Having said this, making people wait for something can in some instances increase the level of perceived value.

Call centre performance can be expressed by the Grade of Service, which is the percentage of calls that are answered within a specific time, for example, 90% of calls are answered within 30 seconds. This Grade of Service depends on the volume of calls made to the centre, the number of available agents and the time it takes to process a contact. Although working in a call centre can be chaotic, the Erlang C formula describes the relationship between the Grade of Service and these variables quite accurately.

Call centre workforce planning is a complex activity that is a perfect problem to solve in R code. This article explains how to use the Erlang C formula in the R language to manage a contact centre by calculating the number of agents needed to meet a required Grade of Service. This approach is extended with a Monte Carlo situation to understand the stochastic nature of the real world better.

## The Erlang C Formula

The Erlang C formula describes the probability that a customer needs to queue instead of being immediately serviced $(P_w)$. This formula is closely related to the Poisson distribution which describes queues such as traffic lights.

$P_w = \frac{\frac{A^N}{N!}\frac{N}{N-A}}{\Big( \sum_{i=0}^{N-1} \frac{A^i}{i!} \Big)+\frac{A^N}{N!}\frac{N}{N-A}}$

The intensity of traffic $A$ is the number of calls per hour multiplied by the average duration of a call. Traffic intensity is measured in dimensionless Erlang units which expresses the time it would take to manage all calls if they arrived sequentially. The intensity is a measure of the amount of effort that needs to be undertaken in an hour. In reality, calls arrive at random times during the hour, which is where the Poisson distribution comes in. The waiting time is also influenced by the number of available operators $N$. The intensity defines the minimum number of agents needed to manage the workload.

We can now deconstruct this formula in a common sense way by saying that the level of service increases as the intensity (the combination of call volume and average duration) reduces and the number of operator increases. The more staff, the higher the level of service, but precisely how many people do you need to achieve your desired grade of service efficiently?

The Grade of Service $S$ is a function of the outcome of the Erlang C formula ($P_w$), the number of agents ($N$), the call intensity ($A$), call duration ($\lambda$) and lastly the target answering time $(t$).

$S = 1 - \Large[ P_w e^ {-[(N-A](t/ \lambda)]} \large]$

The Erlang C formula can be reworked to provide that answer. I sourced this formula from callcenterhelper.com but must admit that I don’ t fully understand it and will take it at face value.

We now have a toolset for call centre planning which we can implement in the R language.

## Erlang C in R

The Erlang C formula contains some factorials and powers, which become problematic when dealing with large call volumes or a large number of agents. The Multiple Precision Arithmetic package enables working with large integer factorials, but there is no need to wield such strong computing powers. To make life easier, the Erlang C formula includes the Erlang B formula, the inverse of which can be calculated using a small loop.

This implementation is very similar to an unpublished R package by Patrick Hubers, enhanced with work from callcenterhelper.com. This code contains four functions:

1. intensity: Determines intensity in Erlangs based on the rate of calls per interval, the total call handling time and the interval time in minutes. All functions default to an interval time of sixty minutes.
2. erlang_c: Calculates The Erlang C formula using the number of agents and the variables that determine intensity.
3. service_level: Calculates the service level. The inputs are the same as above plus the period for the Grade of Service in seconds.
4. resource: Seeks the number of agents needed to meet a Grade of Service. This function starts with the minimum number of agents (the intensity plus one agent) and keeps searching until it finds the number of agents that achieve the desired Grade of Service.

intensity <- function(rate, duration, interval = 60) {
(rate / (60 * interval)) * duration
}

erlang_c <- function(agents, rate, duration, interval = 60) {
int <- intensity(rate, duration, interval)
erlang_b_inv <- 1
for (i in 1:agents) {
erlang_b_inv <- 1 + erlang_b_inv * i / int
}
erlang_b <- 1 / erlang_b_inv
agents * erlang_b / (agents - int * (1 - erlang_b))
}

service_level <- function(agents, rate, duration, target, interval = 60) {
pw <- erlang_c(agents, rate, duration, interval)
int <- intensity(rate, duration, interval)
1 - (pw * exp(-(agents - int) * (target / duration)))
}

resource <- function(rate, duration, target, gos_target, interval = 60) {
agents <-round(intensity(rate, duration, interval) + 1)
gos <- service_level(agents, rate, duration, target, interval)
while (gos < gos_target * (gos_target > 1) / 100) {
agents <- agents + 1
gos <- service_level(agents, rate, duration, target, interval)
}
return(c(agents, gos))
}


## Call Centre Workforce Planning Using an Erlang C Monte Carlo Simulation

I have used the Erlang C model to recommend staffing levels in a contact centre some years ago. What this taught me is that the mathematical model is only the first step towards call centre workforce planning. There are several other metrics that can be built on the Erlang C model, such as average occupancy of agents and average handling time.

The Erlang C formula is, like all mathematical models, an idealised version of reality. Agents are not always available; they need breaks, toilet stops and might even go on leave. Employers call this loss of labour shrinkage, which is a somewhat negative term to describe something positive for the employee. The Erlang C model provides you with the number of ‘bums on seats’.

The Erlang C formuala is, like every model, not a perfect represention of reality. The formula tends to overestimate the required resrouces because it assumes that people will stay on hold indefinitely, while the queu will automatically shorten as people losse patience.

The number of employees needed to provide this capacity depends on the working conditions at the call centre. For example, if employees are only available to take calls 70% of their contracted time, you will need $1/0.7=1.4$ staff members for each live agent to meet the Grade of Service.

Another problem is the stochastic nature of call volumes and handling times. The Erlang C model requires a manager to estimate call volume and handling time (intensity) as a static variable, while in reality, it is stochastic and subject to variation. Time series analysis can help to predict call volumes, but every prediction has a degree of uncertainty. We can manage this uncertainty by using a Monte Carlo simulation.

All the functions listed above are rewritten so that they provide a vector of possible answers based on the average call volume and duration and their standard deviation. This simulation assumes a normal distribution for both call volume and the length of each call. The outcome of this simulation is a distribution of service levels.

### Monte Carlo Simulation

For example, a call centre receives on average 100 calls per half hour with a standard deviation of 10 calls. The average time to manage a call, including wrap-up time after the call, is 180 seconds with a standard deviation of 20 seconds. The centre needs to answer 80% of calls within 20 seconds. What is the likelihood of achieving this level of service?

The average intensity of this scenario is 10 Erlangs. Using the resource formula suggests that we need 14 agents to meet the Grade of Service. Simulating the intensity of the scenario 1000 times suggests we need between 6 and 16 agents to manage this workload.

> resource(100, 180, 20, 80, 30)
[1] 14.0000000  0.88835
> intensity_mc(100, 10, 180, 20) %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.480 8.975 9.939 10.025 10.993 15.932


The next step is to simulate the expected service level for this scenario. The plot visualises the outcome of the Monte Carlo simulation and shows that 95% of situations the Grade of Service is more than 77% and half the time it is more than 94%.


> service_level_mc(15, 100, 10, 180, 20, 20, 30, sims = 1000) %>%
+ quantile(c(.05, .5, .95))
5%        50%       95%
0.7261052 0.9427592 0.9914338


This article shows that Using Erlang C in R helps managers with call centre workforce planning. Perhaps we need a Shiny application to develop a tool to manage the complexity of these functions. I would love to hear from people with practical experience in managing call centres in how they analyse data.

library(tidyverse)

intensity_mc <- function(rate_m, rate_sd, duration_m, duration_sd, interval = 60, sims = 1000) { (rnorm(sims, rate_m, rate_sd) / (60 * interval)) * rnorm(sims, duration_m, duration_sd) } intensity_mc(100, 10, 180, 20, interval = 30) %>% summary

erlang_c_mc <- function(agents, rate_m, rate_sd, duration_m, duration_sd, interval = 60) {
int <- intensity_mc(rate_m, rate_sd, duration_m, duration_sd, interval)
erlang_b_inv <- 1
for (i in 1:agents) {
erlang_b_inv <- 1 + erlang_b_inv * i / int
}
erlang_b <- 1 / erlang_b_inv
agents * erlang_b / (agents - int * (1 - erlang_b))
}

service_level_mc <- function(agents, rate_m, rate_sd, duration_m, duration_sd, target, interval = 60, sims = 1000) {
pw <- erlang_c_mc(agents, rate_m, rate_sd, duration_m, duration_sd, interval)
int <- intensity_mc(rate_m, rate_sd, duration_m, duration_sd, interval, sims)
1 - (pw * exp(-(agents - int) * (target / rnorm(sims, duration_m, duration_sd))))
}

data_frame(ServiceLevel = service_level_mc(agents = 12,
rate_m = 100,
rate_sd = 10,
duration_m = 180,
duration_sd = 20,
target = 20,
interval = 30,
sims = 1000)) %>%
ggplot(aes(ServiceLevel)) +
geom_histogram(binwidth = 0.1, fill = "#008da1")


The post Call Centre Workforce Planning Using Erlang C in R language appeared first on The Lucid Manager.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Randomize by, or within, cluster?

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

I am involved with a stepped-wedge designed study that is exploring whether we can improve care for patients with end-stage disease who show up in the emergency room. The plan is to train nurses and physicians in palliative care. (A while ago, I described what the stepped wedge design is.)

Under this design, 33 sites around the country will receive the training at some point, which is no small task (and fortunately as the statistician, this is a part of the study I have little involvement). After hearing about this ambitious plan, a colleague asked why we didn’t just randomize half the sites to the intervention and conduct a more standard cluster randomized trial, where a site would either get the training or not. I quickly simulated some data to see what we would give up (or gain) if we had decided to go that route. (It is actually a moot point, since there would be no way to simultaneously train 16 or so sites, which is why we opted for the stepped-wedge design in the first place.)

I simplified things a bit by comparing randomization within site with randomization by site. The stepped wedge design is essentially a within-site randomization, except that the two treatment arms are defined at different time points, and things are complicated a bit because there might be time by intervention confounding. But, I won’t deal with that here.

### Simulate data

library(simstudy)

# define data

cvar <- iccRE(0.20, dist = "binary")

d <- defData(varname = "a", formula = 0, variance = cvar,
dist = "normal", id = "cid")
d <- defData(d, varname = "nper", formula = 100, dist = "nonrandom")

da <- defDataAdd(varname = "y", formula = "-1 + .4*rx + a",
dist="binary", link = "logit")

### Randomize within cluster

set.seed(11265)

dc <- genData(100, d)

di <- genCluster(dc, "cid", "nper", "id")
di <- trtAssign(di, strata = "cid", grpName = "rx")

di
##           id rx cid          a nper y
##     1:     1  1   1 -0.4389391  100 1
##     2:     2  0   1 -0.4389391  100 0
##     3:     3  1   1 -0.4389391  100 0
##     4:     4  0   1 -0.4389391  100 0
##     5:     5  0   1 -0.4389391  100 1
##    ---
##  9996:  9996  0 100 -1.5749783  100 0
##  9997:  9997  1 100 -1.5749783  100 0
##  9998:  9998  0 100 -1.5749783  100 0
##  9999:  9999  1 100 -1.5749783  100 0
## 10000: 10000  1 100 -1.5749783  100 0

I fit a conditional mixed effects model, and then manually calculate the conditional log odds from the data just to give a better sense of what the conditional effect is (see earlier post for more on conditional vs. marginal effects).

library(lme4)
rndTidy(glmer(y ~ rx + (1 | cid), data = di, family = binomial))
##                 term estimate std.error statistic p.value group
## 1        (Intercept)    -0.86      0.10     -8.51       0 fixed
## 2                 rx     0.39      0.05      8.45       0 fixed
## 3 sd_(Intercept).cid     0.95        NA        NA      NA   cid
calc <- di[, .(estp = mean(y)), keyby = .(cid, rx)]
calc[, lo := log(odds(estp))]
calc[rx == 1, mean(lo)] - calc[rx == 0, mean(lo)] 
## [1] 0.3985482

Next, I fit a marginal model and calculate the effect manually as well.

library(geepack)
rndTidy(geeglm(y ~ rx, data = di, id = cid, corstr = "exchangeable",
family = binomial))
##          term estimate std.error statistic p.value
## 1 (Intercept)    -0.74      0.09     67.09       0
## 2          rx     0.32      0.04     74.80       0
log(odds(di[rx==1, mean(y)])/odds(di[rx==0, mean(y)]))
## [1] 0.323471

As expected, the marginal estimate of the effect is less than the conditional effect.

### Randomize by cluster

Next we repeat all of this, though randomization is at the cluster level.

dc <- genData(100, d)
dc <- trtAssign(dc, grpName = "rx")

di <- genCluster(dc, "cid", "nper", "id")

di
##        cid rx          a nper    id y
##     1:   1  0  0.8196365  100     1 0
##     2:   1  0  0.8196365  100     2 1
##     3:   1  0  0.8196365  100     3 0
##     4:   1  0  0.8196365  100     4 0
##     5:   1  0  0.8196365  100     5 0
##    ---
##  9996: 100  1 -0.1812079  100  9996 1
##  9997: 100  1 -0.1812079  100  9997 0
##  9998: 100  1 -0.1812079  100  9998 0
##  9999: 100  1 -0.1812079  100  9999 1
## 10000: 100  1 -0.1812079  100 10000 0

Here is the conditional estimate of the effect:

rndTidy(glmer(y~rx + (1|cid), data = di, family = binomial))
##                 term estimate std.error statistic p.value group
## 1        (Intercept)    -0.71      0.15     -4.69    0.00 fixed
## 2                 rx     0.27      0.21      1.26    0.21 fixed
## 3 sd_(Intercept).cid     1.04        NA        NA      NA   cid

And here is the marginal estimate

rndTidy(geeglm(y ~ rx, data = di, id = cid, corstr = "exchangeable",
family = binomial))
##          term estimate std.error statistic p.value
## 1 (Intercept)    -0.56      0.13     18.99    0.00
## 2          rx     0.21      0.17      1.46    0.23

While the within- and by-site randomization estimates are quite different, we haven’t really learned anything, since those differences could have been due to chance. So, I created 500 data sets under different assumptions to see what the expected estimate would be as well as the variability of the estimate.

### Fixed ICC, varied randomization

From this first set of simulations, the big take away is that randomizing within clusters provides an unbiased estimate of the conditional effect, but so does randomizing by site. The big disadvantage of randomizing by site is the added variability of the conditional estimate. The attenuation of the marginal effect estimates under both scenarios has nothing to do with randomization, but results from intrinsic variability across sites.

### Fixed randomization, varied ICC

This next figure isolates the effect of across-site variability on the estimates. In this case, randomization is only by site (i.e. no within site randomization), but the ICC is set at 0.05 and 0.20. For the conditional model, the ICC has no impact on the expected value of the log-odds ratio, but when variability is higher (ICC = 0.20), the standard error of the estimate increases. For the marginal model, the ICC has an impact on both the expected value and standard error of the estimate. In the case with a low ICC (top row in plot), the marginal and condition estimates are quite similar.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Amazon Alexa and Accented English

(This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers)

Earlier this spring, one of my data science friends here in SLC got in contact with me about some fun analysis. My friend Dylan Zwick is a founder at Pulse Labs, a voice-testing startup, and they were chatting with the Washington Post about a piece on how devices like Amazon Alexa deal with accented English. The piece is published today in the Washington Post and turned out really interesting! Let’s walk through the analysis I did for Dylan and Pulse Labs.

## Understanding the data

Dylan shared voice testing results data with me via Google Sheets. The dataset included the phrase that each speaker spoke aloud, the transcription of the phrase that the Alexa device understood, and a categorization for each speaker’s accent.

library(tidyverse)
library(stringdist)

alexa_raw <- gs_title("Alexa Speech to Text by Accent Data") %>%
verbose = FALSE) %>%
set_names("truth", "measured", "accent", "example")

What do a few examples look like?

alexa_raw %>%
sample_n(3) %>%
select(truth, measured, accent) %>%
kable()
truth measured accent
China proposes removal of two term limit potentially paving way for Xi to remain President china proposes removal of to term limit potentially paving way for gh remain president 1
As winter games close team USA falls short of expectations ask winter games close team usa fall short of expectations 1
China proposes removal of two term limit potentially paving way for Xi to remain President china china proposes removal of to time limit potentially paving way to remain president 1

The truth column here contains the phrase that the speaker was instructed to read (there are three separate test phrases), while the measured column contains the text as it was transcribed by Alexa. The accent column is a numeric coding (1, 2, or 3) for the three categories of accented English in this text. The three categories are US flat (which would be typical broadcast English in the US, often encounted in the West and Midwest), a native speaker accent (these folks included Southern US accents and accents from Britain and Australia), and a non-native speaker accent (individuals for whom English is not their first language).

alexa <- alexa_raw %>%
mutate(accent = case_when(accent == 1 ~ "US flat",
accent == 2 ~ "Native speaker accent",
accent == 3 ~ "Non-native speaker accent"),
accent = factor(accent, levels = c("US flat",
"Native speaker accent",
"Non-native speaker accent")),
example = case_when(example == "X" ~ TRUE,
TRUE ~ FALSE),
truth = str_to_lower(truth),
measured = str_to_lower(measured)) %>%
filter(truth != "phrase",
truth != "") %>%
mutate(distance = stringdist(truth, measured, "lv"))

How many recordings from an Alexa device do we have data for, for each accent?

alexa %>%
count(accent) %>%
kable()
accent n
US flat 46
Native speaker accent 33
Non-native speaker accent 20

This is a pretty small sample; we would be able to make stronger conclusions with more recordings.

## Visualizations

Let’s look at the string distance between each between each benchmark phrase (the phrase that the speaker intended to speak) and the speech-to-text output from Alexa. We can think about this metric as the difference between what the speaker said and what Alexa heard.

alexa %>%
ggplot(aes(accent, distance, fill = accent, color = accent)) +
geom_boxplot(alpha = 0.2, size = 1.5) +
labs(x = NULL, y = "String distance (Levenshtein distance)",
title = "How well does Alexa understand different accents?",
subtitle = "Speech with non-native accents is converted to text with the lowest accuracy") +
theme(legend.position="none")

I used the Levenshtein distance, but the results are robust to other string distance measures.

alexa %>%
group_by(accent) %>%
summarise(distance = median(distance)) %>%
ggplot(aes(accent, distance, fill = accent)) +
geom_col(alpha = 0.8) +
geom_text(aes(x = accent, y = 0.5, label = accent), color="white",
family="IBMPlexSans-Medium", size=7, hjust = 0) +
labs(x = NULL, y = "String distance between phrase and speech-to-text output (median Levenshtein distance)",
title = "How well does Alexa understand English speakers with different accents?",
subtitle = "Speech with non-native accents is converted to text with the lowest accuracy") +
scale_y_continuous(expand = c(0,0)) +
theme(axis.text.y=element_blank(),
legend.position="none") +
coord_flip()

We can see here that the median difference is higher, by over 30%, for speakers with non-native-speaking accents. There is no difference for speakers with accents like British or Southern accents. That result looks pretty convincing, and certainly lines up with what other groups in the WashPo piece found, but it’s based on quite a small sample. Let’s try a statistical test.

## Statistical tests

Let’s compare first the native speaker accent to the US flat group, then the non-native speakers to the US flat group.

t.test(distance ~ accent, data = alexa %>% filter(accent != "Non-native speaker accent"))
##
##  Welch Two Sample t-test
##
## data:  distance by accent
## t = -0.55056, df = 60.786, p-value = 0.584
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.875468  2.202214
## sample estimates:
##               mean in group US flat mean in group Native speaker accent
##                            8.739130                            9.575758
t.test(distance ~ accent, data = alexa %>% filter(accent != "Native speaker accent"))
##
##  Welch Two Sample t-test
##
## data:  distance by accent
## t = -1.3801, df = 25.213, p-value = 0.1797
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.125065  1.603326
## sample estimates:
##                   mean in group US flat
##                                 8.73913
## mean in group Non-native speaker accent
##                                12.00000

Performing some t-tests indicates that the group of speakers with flat accents and those with native speaker accents (Southern, British, etc.) are not different from each other; notice how big the p-value is (almost 0.6).

The situation is not clear for the comparison of the speakers with flat accents and those with non-native speaker accents, either. The p-value is about 0.18, higher than normal statistical cutoffs. It would be better to have more data to draw clear conclusions. Let’s do a simple power calculation to estimate how many measurements we would need to measure a difference this big (~30%, or ~3 on the string distance scale).

power.t.test(delta = 3, sd = sd(alexa$distance), sig.level = 0.05, power = 0.8) ## ## Two-sample t test power calculation ## ## n = 93.37079 ## delta = 3 ## sd = 7.278467 ## sig.level = 0.05 ## power = 0.8 ## alternative = two.sided ## ## NOTE: n is number in *each* group This indicates we would need on the order of 90 examples per group (instead of the 20 to 40 that we have) to measure the ~30% difference we see with statistical significance. That may be a lot of voice testing to do for a single newspaper article but would be necessary to make strong statements. This dataset shows how complicated the landscape for these devices is. Check out the piece online (which includes quotes from Kaggle’s Rachael Tatman) and let me know if you have any feedback or questions! To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ## July 18, 2018 ### If you have a measure, it will be gamed (politics edition). They sometimes call it Campbell’s Law: New York Governor Andrew Cuomo is not exactly known for drumming up grassroots enthusiasm and small donor contributions, so it was quite a surprise on Monday when his reelection campaign reported that more than half of his campaign contributors this year gave$250 or less.

But wait—a closer examination of those donations reveals a very odd fact: 69 of them came from just one person, Christopher Kim.

Even odder, it appears Kim lives at the same address as one of Cuomo’s aides! . . .

1) Cuomo has testily fielded questions from reporters about his donor base and that of his primary opponent, Cynthia Nixon, who loves to needle him over his cozy relationships with rich donors, and who also, in March, told the Buffalo News, “In one day of fundraising I received more small donor [contributions] than Andrew Cuomo received in seven years.” 2) All at once, Cuomo’s campaign got an influx of small donations from someone who appears to share an address with a Cuomo aide. . . .

\$1 donations, huh? What the campaign should really do is set up a set of booths where you can just drop a quarter in a slot to make your campaign donation. They could put them in laundromats . . . Hey—do laundromats still take quarters? It’s been a long time since I’ve been in one! Maybe, ummm, I dunno, an arcade?

### If you did not already know

TensorFlow Hub
TensorFlow Hub is a library to foster the publication, discovery, and consumption of reusable parts of machine learning models. A module is a self-contained piece of a TensorFlow graph, along with its weights and assets, that can be reused across different tasks in a process known as transfer learning. Modules contain variables that have been pre-trained for a task using a large dataset. By reusing a module on a related task, you can:
• train a model with a smaller dataset,
• improve generalization, or
• significantly speed up training. …

AEQUITAS
Fairness is a critical trait in decision making. As machine-learning models are increasingly being used in sensitive application domains (e.g. education and employment) for decision making, it is crucial that the decisions computed by such models are free of unintended bias. But how can we automatically validate the fairness of arbitrary machine-learning models For a given machine-learning model and a set of sensitive input parameters, our AEQUITAS approach automatically discovers discriminatory inputs that highlight fairness violation. At the core of AEQUITAS are three novel strategies to employ probabilistic search over the input space with the objective of uncovering fairness violation. Our AEQUITAS approach leverages inherent robustness property in common machine-learning models to design and implement scalable test generation methodologies. An appealing feature of our generated test inputs is that they can be systematically added to the training set of the underlying model and improve its fairness. To this end, we design a fully automated module that guarantees to improve the fairness of the underlying model. We implemented AEQUITAS and we have evaluated it on six state-of-the-art classifiers, including a classifier that was designed with fairness constraints. We show that AEQUITAS effectively generates inputs to uncover fairness violation in all the subject classifiers and systematically improves the fairness of the respective models using the generated test inputs. In our evaluation, AEQUITAS generates up to 70% discriminatory inputs (w.r.t. the total number of inputs generated) and leverages these inputs to improve the fairness up to 94%. …

QA4IE
Information Extraction (IE) refers to automatically extracting structured relation tuples from unstructured texts. Common IE solutions, including Relation Extraction (RE) and open IE systems, can hardly handle cross-sentence tuples, and are severely restricted by limited relation types as well as informal relation specifications (e.g., free-text based relation tuples). In order to overcome these weaknesses, we propose a novel IE framework named QA4IE, which leverages the flexible question answering (QA) approaches to produce high quality relation triples across sentences. Based on the framework, we develop a large IE benchmark with high quality human evaluation. This benchmark contains 293K documents, 2M golden relation triples, and 636 relation types. We compare our system with some IE baselines on our benchmark and the results show that our system achieves great improvements. …

### Math for Machine Learning: Open Doors to Data Science and Artificial Intelligence

This ebook explains the math involved and introduces you directly to the foundational topics in machine learning.

### “For professional baseball players, faster hand-eye coordination linked to batting performance”

Kevin Lewis sends along this press release reporting what may be the least surprising laboratory finding since the classic “Participants reported being hungrier when they walked into the café (mean = 7.38, SD = 2.20) than when they walked out [mean = 1.53, SD = 2.70, F(1, 75) = 107.68, P < 0.001]."

### R Packages worth a look

RF Variable Importance for Arbitrary Measures (varImp)
Computes the random forest variable importance (VIMP) for the conditional inference random forest (cforest) of the ‘party’ package. Includes a function …

Efficient Sampling Truncated Scale of Normals with Constraints (tmvmixnorm)
Efficient sampling of truncated multivariate (scale) mixtures of normals under linear inequality constraints is nontrivial due to the analytically intr …

Simulation with Kernel Density Estimation (simukde)
Generates random values from a univariate and multivariate continuous distribution by using kernel density estimation based on a sample. Duong (2017) & …

Spatially Varying Change Points (spCP)
Implements a spatially varying change point model with unique intercepts, slopes, variance intercepts and slopes, and change points at each location. I …

Resistant Procrustes Superimposition (RPS)
Based on RPS tools, a rather complete resistant shape analysis of 2D and 3D datasets based on landmarks can be performed. In addition, landmark-based r …

### Scaling Our Private Portals with Open edX and Docker

Ever since we launched, Cognitive Class has hit many milestones. From name changes (raise your hand if you remember DB2 University) to our 1,000,000th learner, we’ve been through a lot.

But in this post, I will focus on the milestones and evolution of the technical side of things, specifically how we went from a static infrastructure to a dynamic and scalable deployment of dozens of Open edX instances using Docker.

## Open edX 101

Open edX is the open source code behind edx.org. It is composed of several repositories, edx-platform being the main one. The official method of deploying an Open edX instance is by using the configuration repo which uses Ansible playbooks to automate the installation. This method requires access to a server where you run the Ansible playbook. Once everything is done you will have a brand new Open edX deployment at your disposal.

This is how we run cognitiveclass.ai, our public website, since we migrated from a Moodle deployment to Open edX in 2015. It has served us well, as we are able to serve hundreds of concurrent learners over 70 courses every day.

But this strategy didn’t come without its challenges:

• Open edX mainly targets Amazon’s AWS services and we run our infrastructure on IBM Cloud.
• Deploying a new instance requires creating a new virtual machine.
• Open edX reads configurations from JSON files stored in the server, and each instance must keep these files synchronized.

While we were able to overcome these in a large single deployment, they would be much harder to manage for our new offering, the Cognitive Class Private Portals.

When presenting to other companies, we often hear the same question: “how can I make this content available to my employees?“. That was the main motivation behind our Private Portals offer.

A Private Portal represents a dedicated deployment created specifically for a client. From a technical perspective, this new offering would require us to spin up new deployments quickly and on-demand. Going back to the points highlighted earlier, numbers two and three are especially challenging as the number of deployments grows.

Creating and configuring a new VM for each deployment is a slow and costly process. And if a particular Portal outgrows its resources, we would have to find a way to scale it and manage its configuration across multiple VMs.

## Enter Docker

At the same time, we were experiencing a similar demand in our Virtual Labs infrastructure, where the use of hundreds of VMs was becoming unbearable. The team started to investigate and implement a solution based on Docker.

The main benefits of Docker for us were twofold:

• Increase server usage density;
• Isolate services processes and files from each other.

These benefits are deeply related: since each container manages its own runtime and files we are able to easily run different pieces of software on the same server without them interfering with each other. We do so with a much lower overhead compared to VMs since Docker provides a lightweight isolation between them.

By increasing usage density, we are able to run thousands of containers in a handful of larger servers that could pre-provisioned ahead of time instead of having to manage thousands of smaller instances.

For our Private Portals offering this means that a new deployment can be ready to be used in minutes. The underlying infrastructure is already in place so we just need to start some containers, which is a much faster process.

## Herding containers with Rancher

Docker in and of itself is a fantastic technology but for a highly scalable distributed production environment, you need something on top of it to manage your containers’ lifecycle. Here at Cognitive Class, we decided to use Rancher for this, since it allows us to abstract our infrastructure and focus on the application itself.

In a nutshell, Rancher organizes containers into services and services are grouped into stacks. Stacks are deployed to environments, and environments have hosts, which are the underlying servers where containers are eventually started. Rancher takes care of creating a private network across all the hosts so they can communicate securely with each other.

## Getting everything together

Our Portals are organized in a micro-services architecture and grouped together in Rancher as a stack. Open edX is the main component and itself broken into smaller services. On top of Open edX we have several other components that provide additional functionalities to our offering. Overall this is how things look like in Rancher:

There is a lot going on here, so let’s break it down and quickly explain each piece:

• Open edX
• lms: this is where students access courses content
• cms: used for authoring courses
• forum: handles course discussions
• nginx: serves static assets
• rabbitmq: message queue system
• glados: admin users interface to control and customize the Portal
• companion-cube: API to expose extra functionalities of Open edX
• compete: service to run data hackathons
• learner-support: built-in learner ticket support system
• lp-certs: issue certificates for students that complete multiple courses
• Support services
• cms-workers and lms-workers: execute background tasks for lms and cms
• glados-worker: execute background tasks for glados
• letsencrypt: automatically manages SSL certificates using Let’s Encrypt
• load-balancer: routes traffic to services based on request hostname
• mailer: proxy SMTP requests to an external server or sends emails itself otherwise
• ops: group of containers used to run specific tasks
• rancher-cron: starts containers following a cron-like schedule
• Data storage
• elasticsearch
• memcached
• mongo
• mysql
• redis

The ops service behaves differently from the other ones, so let’s dig a bit deeper into it:

Here we can see that there are several containers inside ops and that they are usually not running. Some containers, like edxapp-migrations, run when the Portal is deployed but are not expected to be started again unless in special circumstances (such as if the database schema changes). Other containers, like backup, are started by rancher-cron periodically and stop once they are done.

In both cases, we can trigger a manual start by clicking the play button. This provides us the ability to easily run important operational tasks on-demand without having to worry about SSH into specific servers and figuring out which script to run.

## Handling files

One key aspect of Docker is that the file system is isolated per container. This means that, without proper care, you might lose important files if a container dies. The way to handle this situation is to use Docker volumes to mount local file system paths into the containers.

Moreover, when you have multiple hosts, it is best to have a shared data layer to avoid creating implicit scheduling dependencies between containers and servers. In other words, you want your containers to have access to the same files no matter which host they are running on.

In our infrastructure we use an IBM Cloud NFS drive that is mounted in the same path in all hosts. The NFS is responsible for storing any persistent data generated by the Portal, from database files to compiled static assets, such as images, CSS and JavaScript files.

Each Portal has its own directory in the NFS drive and the containers mount the directory of that specific Portal. So it’s impossible for one Portal to access the files of another one.

One of the most important file is the ansible_overrides.yml. As we mentioned at the beginning of this post, Open edX is configured using JSON files that are read when the process starts. The Ansible playbook generates these JSON files when executed.

To propagate changes made by Portal admins on glados to the lms and cms of Open edX we mount ansible_overrides.yml into the containers. When something changes, glados can write the new values into this file and lms and cms can read them.

We then restart the lms and cms containers which are set to run the Ansible playbook and re-generate the JSON files on start up. ansible_overrides.yml is passed as a variables file to Ansible so that any values declared in there will override the Open edX defaults.

By having this shared data layer, we don’t have to worry about containers being rescheduled to another host since we are sure Docker will be able to find the proper path and mount the required volumes into the containers.

## Conclusion

By building on top of the lessons we learned as our platform evolved and by using the latest technologies available, we were able to build a fast, reliable and scalable solution to provide our students and clients a great learning experience.

We covered a lot on this post and I hope you were able to learn something new today. If you are interested in learning more about our Private Portals offering fill out our application form and we will contact you.

Happy learning.

The post Scaling Our Private Portals with Open edX and Docker appeared first on Cognitive Class.

### Top KDnuggets tweets, Jul 11-17: Foundations of Machine Learning – A Bloomberg course; The 5 Clustering Algorithms Data Scientists Need to Know

Also: Bayesian Machine Learning, Explained; Is Google Tensorflow Object Detection API the Easiest Way to Implement Image Recognition?; Data Science of Variable Selection: A Review; 7 Steps to Understanding Deep Learning

### The whole is greater than the sum of its parts

Christopher Ferris says Hyperledger was formed to help deliver blockchain technology for the enterprise. Two and a half years later, that goal is being realized.

Continue reading The whole is greater than the sum of its parts .

### Recognizing cultural bias in AI

Camille Eddy explains what we can do to create culturally sensitive computer intelligence and why that's important for the future of AI.

Continue reading Recognizing cultural bias in AI.

### Open source and open standards in the age of cloud AI

Tim O'Reilly looks at how we can extend the values and practices of open source in the age of AI, big data, and cloud computing.

Continue reading Open source and open standards in the age of cloud AI.

### Live coding: OSCON edition

Suz Hinton live codes an entertaining hardware solution in front of your eyes.

Continue reading Live coding: OSCON edition.

### Highlights from the O'Reilly OSCON Conference in Portland 2018

Watch highlights covering open source, AI, cloud, and more. From the O'Reilly OSCON Conference in Portland 2018.

People from across the open source world are coming together in Portland, Oregon for the O'Reilly OSCON Conference. Below you'll find links to highlights from the event.

## Open source and open standards in the age of cloud AI

Tim O'Reilly looks at how we can extend the values and practices of open source in the age of AI, big data, and cloud computing.

## Live coding: OSCON edition

Suz Hinton live codes an entertaining hardware solution in front of your eyes.

## Drive innovation and collaboration through open source projects

Ying Xiong explains how Huawei collaborates with industry leaders and innovates through open source projects.

## Recognizing cultural bias in AI

Camille Eddy explains what we can do to create culturally sensitive computer intelligence and why that's important for the future of AI.

## The whole is greater than the sum of its parts

Christopher Ferris says Hyperledger was formed to help deliver blockchain technology for the enterprise. Two and a half years later, that goal is being realized.

Continue reading Highlights from the O'Reilly OSCON Conference in Portland 2018.

### Drive innovation and collaboration through open source projects

Ying Xiong explains how Huawei collaborates with industry leaders and innovates through open source projects.

Continue reading Drive innovation and collaboration through open source projects.

### Whats new on arXiv

Recently, recurrent neural networks have become state-of-the-art in acoustic modeling for automatic speech recognition. The long short-term memory (LSTM) units are the most popular ones. However, alternative units like gated recurrent unit (GRU) and its modifications outperformed LSTM in some publications. In this paper, we compared five neural network (NN) architectures with various adaptation and feature normalization techniques. We have evaluated feature-space maximum likelihood linear regression, five variants of i-vector adaptation and two variants of cepstral mean normalization. The most adaptation and normalization techniques were developed for feed-forward NNs and, according to results in this paper, not all of them worked also with RNNs. For experiments, we have chosen a well known and available TIMIT phone recognition task. The phone recognition is much more sensitive to the quality of AM than large vocabulary task with a complex language model. Also, we published the open-source scripts to easily replicate the results and to help continue the development.
Overlays have shown significant promise for field-programmable gate-arrays (FPGAs) as they allow for fast development cycles and remove many of the challenges of the traditional FPGA hardware design flow. However, this often comes with a significant performance burden resulting in very little adoption of overlays for practical applications. In this paper, we tailor an overlay to a specific application domain, and we show how we maintain its full programmability without paying for the performance overhead traditionally associated with overlays. Specifically, we introduce an overlay targeted for deep neural network inference with only ~1% overhead to support the control and reprogramming logic using a lightweight very-long instruction word (VLIW) network. Additionally, we implement a sophisticated domain specific graph compiler that compiles deep learning languages such as Caffe or Tensorflow to easily target our overlay. We show how our graph compiler performs architecture-driven software optimizations to significantly boost performance of both convolutional and recurrent neural networks (CNNs/RNNs) – we demonstrate a 3x improvement on ResNet-101 and a 12x improvement for long short-term memory (LSTM) cells, compared to naive implementations. Finally, we describe how we can tailor our hardware overlay, and use our graph compiler to achieve ~900 fps on GoogLeNet on an Intel Arria 10 1150 – the fastest ever reported on comparable FPGAs.
Measuring similarities between strings is central for many established and fast growing research areas including information retrieval, biology, and natural language processing. The traditional approach for string similarity measurements is to define a metric over a word space that quantifies and sums up the differences between characters in two strings. The state-of-the-art in the area has, surprisingly, not evolved much during the last few decades. The majority of the metrics are based on a simple comparison between character and character distributions without consideration for the context of the words. This paper proposes a string metric that encompasses similarities between strings based on (1) the character similarities between the words including. Non-Standard and standard spellings of the same words, and (2) the context of the words. Our proposal is a neural network composed of a denoising autoencoder and what we call a context encoder specifically designed to find similarities between the words based on their context. The experimental results show that the resulting metrics succeeds in 85.4\% of the cases in finding the correct version of a non-standard spelling among the closest words, compared to 63.2\% with the established Normalised-Levenshtein distance. Besides, we show that words used in similar context are with our approach calculated to be similar than words with different contexts, which is a desirable property missing in established string metrics.
Entity linking is the task of mapping potentially ambiguous terms in text to their constituent entities in a knowledge base like Wikipedia. This is useful for organizing content, extracting structured data from textual documents, and in machine learning relevance applications like semantic search, knowledge graph construction, and question answering. Traditionally, this work has focused on text that has been well-formed, like news articles, but in common real world datasets such as messaging, resumes, or short-form social media, non-grammatical, loosely-structured text adds a new dimension to this problem. This paper presents Pangloss, a production system for entity disambiguation on noisy text. Pangloss combines a probabilistic linear-time key phrase identification algorithm with a semantic similarity engine based on context-dependent document embeddings to achieve better than state-of-the-art results (>5% in F1) compared to other research or commercially available systems. In addition, Pangloss leverages a local embedded database with a tiered architecture to house its statistics and metadata, which allows rapid disambiguation in streaming contexts and on-device disambiguation in low-memory environments such as mobile phones.
As machine learning (ML) systems become democratized, it becomes increasingly important to help users easily debug their models. However, current data tools are still primitive when it comes to helping users trace model performance problems all the way to the data. We focus on the particular problem of slicing data to identify subsets of the validation data where the model performs poorly. This is an important problem in model validation because the overall model performance can fail to reflect that of the smaller subsets, and slicing allows users to analyze the model performance on a more granular-level. Unlike general techniques (e.g., clustering) that can find arbitrary slices, our goal is to find interpretable slices (which are easier to take action compared to arbitrary subsets) that are problematic and large. We propose Slice Finder, which is an interactive framework for identifying such slices using statistical techniques. Applications include diagnosing model fairness and fraud detection, where identifying slices that are interpretable to humans is crucial.
When customers are faced with the task of making a purchase in an unfamiliar product domain, it might be useful to provide them with an overview of the product set to help them understand what they can expect. In this paper we present and evaluate a method to summarise sets of products in natural language, focusing on the price range, common product features across the set, and product features that impact on price. In our study, participants reported that they found our summaries useful, but we found no evidence that the summaries influenced the selections made by participants.
In the last several years, Twitter is being adopted by the companies as an alternative platform to interact with the customers to address their concerns. With the abundance of such unconventional conversation resources, push for developing effective virtual agents is more than ever. To address this challenge, a better understanding of such customer service conversations is required. Lately, there have been several works proposing a novel taxonomy for fine-grained dialogue acts as well as develop algorithms for automatic detection of these acts. The outcomes of these works are providing stepping stones for the ultimate goal of building efficient and effective virtual agents. But none of these works consider handling the notion of negation into the proposed algorithms. In this work, we developed an SVM-based dialogue acts prediction algorithm for Twitter customer service conversations where negation handling is an integral part of the end-to-end solution. For negation handling, we propose several efficient heuristics as well as adopt recent state-of- art third party machine learning based solutions. Empirically we show model’s performance gain while handling negation compared to when we don’t. Our experiments show that for the informal text such as tweets, the heuristic-based approach is more effective.
The problem of quickest detection of dynamic events in networks is studied. At some unknown time, an event occurs, and a number of nodes in the network are affected by the event, in that they undergo a change in the statistics of their observations. It is assumed that the event is dynamic, in that it can propagate along the edges in the network, and affect more and more nodes with time. The event propagation dynamics is assumed to be unknown. The goal is to design a sequential algorithm that can detect a ‘significant’ event, i.e., when the event has affected no fewer than $\eta$ nodes, as quickly as possible, while controlling the false alarm rate. Fully connected networks are studied first, and the results are then extended to arbitrarily connected networks. The designed algorithms are shown to be adaptive to the unknown propagation dynamics, and their first-order asymptotic optimality is demonstrated as the false alarm rate goes to zero. The algorithms can be implemented with linear computational complexity in the network size at each time step, which is critical for online implementation. Numerical simulations are provided to validate the theoretical results.
Recommendation systems are an integral part of Artificial Intelligence (AI) and have become increasingly important in the growing age of commercialization in AI. Deep learning (DL) techniques for recommendation systems (RS) provide powerful latent-feature models for effective recommendation but suffer from the major drawback of being non-interpretable. In this paper we describe a framework for explainable temporal recommendations in a DL model. We consider an LSTM based Recurrent Neural Network (RNN) architecture for recommendation and a neighbourhood-based scheme for generating explanations in the model. We demonstrate the effectiveness of our approach through experiments on the Netflix dataset by jointly optimizing for both prediction accuracy and explainability.
This paper presents a novel data-driven approach for predicting the number of vegetation-related outages that occur in power distribution systems on a monthly basis. In order to develop an approach that is able to successfully fulfill this objective, there are two main challenges that ought to be addressed. The first challenge is to define the extent of the target area. An unsupervised machine learning approach is proposed to overcome this difficulty. The second challenge is to correctly identify the main causes of vegetation-related outages and to thoroughly investigate their nature. In this paper, these outages are categorized into two main groups: growth-related and weather-related outages, and two types of models, namely time series and non-linear machine learning regression models are proposed to conduct the prediction tasks, respectively. Moreover, various features that can explain the variability in vegetation-related outages are engineered and employed. Actual outage data, obtained from a major utility in the U.S., in addition to different types of weather and geographical data are utilized to build the proposed approach. Finally, a comprehensive case study is carried out to demonstrate how the proposed approach can be used to successfully predict the number of vegetation-related outages and to help decision-makers to detect vulnerable zones in their systems.
Containers, enabling lightweight environment and performance isolation, fast and flexible deployment, and fine-grained resource sharing, have gained popularity in better application management and deployment in addition to hardware virtualization. They are being widely used by organizations to deploy their increasingly diverse workloads derived from modern-day applications such as web services, big data, and IoT in either proprietary clusters or private and public cloud data centers. This has led to the emergence of container orchestration platforms, which are designed to manage the deployment of containerized applications in large-scale clusters. These systems are capable of running hundreds of thousands of jobs across thousands of machines. To do so efficiently, they must address several important challenges including scalability, fault-tolerance and availability, efficient resource utilization, and request throughput maximization among others. This paper studies these management systems and proposes a taxonomy that identifies different mechanisms that can be used to meet the aforementioned challenges. The proposed classification is then applied to various state-of-the-art systems leading to the identification of open research challenges and gaps in the literature intended as future directions for researchers working in this topic.
An important problem in machine learning and statistics is to identify features that causally affect the outcome. This is often impossible to do from purely observational data, and a natural relaxation is to identify features that are correlated with the outcome even conditioned on all other observed features. For example, we want to identify that smoking really is correlated with cancer conditioned on demographics. The knockoff procedure is a recent breakthrough in statistics that, in theory, can identify truly correlated features while guaranteeing that the false discovery is limited. The idea is to create synthetic data -knockoffs- that captures correlations amongst the features. However there are substantial computational and practical challenges to generating and using knockoffs. This paper makes several key advances that enable knockoff application to be more efficient and powerful. We develop an efficient algorithm to generate valid knockoffs from Bayesian Networks. Then we systematically evaluate knockoff test statistics and develop new statistics with improved power. The paper combines new mathematical guarantees with systematic experiments on real and synthetic data.
With the growing adoption of machine learning techniques, there is a surge of research interest towards making machine learning systems more transparent and interpretable. Various visualizations have been developed to help model developers understand, diagnose, and refine machine learning models. However, a large number of potential but neglected users are the domain experts with little knowledge of machine learning but are expected to work with machine learning systems. In this paper, we present an interactive visualization technique to help users with little expertise in machine learning to understand, explore and validate predictive models. By viewing the model as a black box, we extract a standardized rule-based knowledge representation from its input-output behavior. We design RuleMatrix, a matrix-based visualization of rules to help users navigate and verify the rules and the black-box model. We evaluate the effectiveness of RuleMatrix via two use cases and a usability study.
Recommender Systems have been widely used to help users in finding what they are looking for thus tackling the information overload problem. After several years of research and industrial findings looking after better algorithms to improve accuracy and diversity metrics, explanation services for recommendation are gaining momentum as a tool to provide a human-understandable feedback to results computed, in most of the cases, by black-box machine learning techniques. As a matter of fact, explanations may guarantee users satisfaction, trust, and loyalty in a system. In this paper, we evaluate how different information encoded in a Knowledge Graph are perceived by users when they are adopted to show them an explanation. More precisely, we compare how the use of categorical information, factual one or a mixture of them both in building explanations, affect explanatory criteria for a recommender system. Experimental results are validated through an A/B testing platform which uses a recommendation engine based on a Semantics-Aware Autoencoder to build users profiles which are in turn exploited to compute recommendation lists and to provide an explanation.
Recent works in recommendation systems have focused on diversity in recommendations as an important aspect of recommendation quality. In this work we argue that the post-processing algorithms aimed at only improving diversity among recommendations lead to discrimination among the users. We introduce the notion of user fairness which has been overlooked in literature so far and propose measures to quantify it. Our experiments on two diversification algorithms show that an increase in aggregate diversity results in increased disparity among the users.
Many theories of deep learning have shown that a deep network can require dramatically fewer resources to represent a given function compared to a shallow network. But a question remains: can these efficient representations be learned using current deep learning techniques In this work, we test whether standard deep learning methods can in fact find the efficient representations posited by several theories of deep representation. Specifically, we train deep neural networks to learn two simple functions with known efficient solutions: the parity function and the fast Fourier transform. We find that using gradient-based optimization, a deep network does not learn the parity function, unless initialized very close to a hand-coded exact solution. We also find that a deep linear neural network does not learn the fast Fourier transform, even in the best-case scenario of infinite training data, unless the weights are initialized very close to the exact hand-coded solution. Our results suggest that not every element of the class of compositional functions can be learned efficiently by a deep network, and further restrictions are necessary to understand what functions are both efficiently representable and learnable.
We design and study a Contextual Memory Tree (CMT), a learning memory controller that inserts new memories into an experience store of unbounded size. It is designed to efficiently query for memories from that store, supporting logarithmic time insertion and retrieval operations. Hence CMT can be integrated into existing statistical learning algorithms as an augmented memory unit without substantially increasing training and inference computation. We demonstrate the efficacy of CMT by augmenting existing multi-class and multi-label classification algorithms with CMT and observe statistical improvement. We also test CMT learning on several image-captioning tasks to demonstrate that it performs computationally better than a simple nearest neighbors memory system while benefitting from reward learning.
The proliferation of information disseminated by public/social media has made decision-making highly challenging due to the wide availability of noisy, uncertain, or unverified information. Although the issue of uncertainty in information has been studied for several decades, little work has investigated how noisy (or uncertain) or valuable (or credible) information can be formulated into people’s opinions, modeling uncertainty both in the quantity and quality of evidence leading to a specific opinion. In this work, we model and analyze an opinion and information model by using Subjective Logic where the initial set of evidence is mixed with different types of evidence (i.e., pro vs. con or noisy vs. valuable) which is incorporated into the opinions of original propagators, who propagate information over a network. With the help of an extensive simulation study, we examine how the different ratios of information types or agents’ prior belief or topic competence affect the overall information diffusion. Based on our findings, agents’ high uncertainty is not necessarily always bad in making a right decision as long as they are competent enough not to be at least biased towards false information (e.g., neutral between two extremes).
In this paper, we propose an acceleration scheme for online memory-limited PCA methods. Our scheme converges to the first $k>1$ eigenvectors in a single data pass. We provide empirical convergence results of our scheme based on the spiked covariance model. Our scheme does not require any predefined parameters such as the eigengap and hence is well facilitated for streaming data scenarios. Furthermore, we apply our scheme to challenging time-varying systems where online PCA methods fail to converge. Specifically, we discuss a family of time-varying systems that are based on Molecular Dynamics simulations where batch PCA converges to the actual analytic solution of such systems.
This paper introduces Jensen, an easily extensible and scalable toolkit for production-level machine learning and convex optimization. Jensen implements a framework of convex (or loss) functions, convex optimization algorithms (including Gradient Descent, L-BFGS, Stochastic Gradient Descent, Conjugate Gradient, etc.), and a family of machine learning classifiers and regressors (Logistic Regression, SVMs, Least Square Regression, etc.). This framework makes it possible to deploy and train models with a few lines of code, and also extend and build upon this by integrating new loss functions and optimization algorithms.
In this paper, we compare different types of Recurrent Neural Network (RNN) Encoder-Decoders in anomaly detection viewpoint. We focused on finding the model what can learn the same data more effectively. We compared multiple models under the same conditions, such as the number of parameters, optimizer, and learning rate. However, the difference is whether to predict the future sequence or restore the current sequence. We constructed the dataset with simple vectors and used them for the experiment. Finally, we experimentally confirmed that the model performs better when the model restores the current sequence, rather than predict the future sequence.
In recent years, situation awareness has been recognised as a critical part of effective decision making, in particular for crisis management. One way to extract value and allow for better situation awareness is to develop a system capable of analysing a dataset of multiple posts, and clustering consistent posts into different views or stories (or, world views). However, this can be challenging as it requires an understanding of the data, including determining what is consistent data, and what data corroborates other data. Attempting to address these problems, this article proposes Subject-Verb-Object Semantic Suffix Tree Clustering (SVOSSTC) and a system to support it, with a special focus on Twitter content. The novelty and value of SVOSSTC is its emphasis on utilising the Subject-Verb-Object (SVO) typology in order to construct semantically consistent world views, in which individuals—particularly those involved in crisis response—might achieve an enhanced picture of a situation from social media data. To evaluate our system and its ability to provide enhanced situation awareness, we tested it against existing approaches, including human data analysis, using a variety of real-world scenarios. The results indicated a noteworthy degree of evidence (e.g., in cluster granularity and meaningfulness) to affirm the suitability and rigour of our approach. Moreover, these results highlight this article’s proposals as innovative and practical system contributions to the research field.

### Announcing Databricks Runtime 4.2!

We’re excited to announce Databricks Runtime 4.2, powered by Apache Spark™.  Version 4.2 includes updated Spark internals, new features, and major performance upgrades to Databricks Delta, as well as general quality improvements to the platform.  We are moving quickly toward the Databricks Delta general availability (GA) release and we recommend you upgrade to Databricks Runtime 4.2 to take advantage of these improvements.

I’d like to take a moment to highlight some of the work the team has done to continually improve Databricks Delta:

• Streaming Directly to Delta Tables: Streams can now be directly written to a Databricks Delta table registered in the Hive metastore using df.writeStream.table(…).
• Path Consistency for Delta Commands: All Databricks Delta commands and queries now support referring to a table using its path as an identifier (that is, delta./path/to/table). Previously OPTIMIZE and VACUUM required non-standard use of string literals (that is, ‘/path/to/table’).

We’ve also included powerful new features to Structured Streaming:

• Robust Streaming Pipelines with Trigger.Once: is now supported in Databricks Delta.   Rate limits (for example maxOffsetsPerTrigger or maxFilesPerTrigger) specified as source options or defaults could result in partial execution of available data. These options are now ignored when Trigger.Once is used, allowing all currently available data to be processed.  Documentation is available at: Trigger.Once in the Databricks Runtime 4.2 release notes.
• Flexible Streaming Sink to Many Storage Options with foreachBatch(): You can now define a function to process the output of every microbatch using DataFrame operations in Scala.  Documentation is available at: foreachBatch(). This can help in new ways of flexibility but most importantly, foreachBatch() can let you write to a range of storage options even if they don’t support streaming as a sink.
• Support for streaming foreach() in Python has also been added. Documentation is available at: foreach().

We included support for the SQL Deny command for table access control enabled clusters. Users can now deny specific permissions in the same way they are granted. A denied permission will supersede a granted one.  Detailed technical documentation is available at: SQL DENY.

To read more about the above new features and to see the full list of improvements included in Databricks Runtime 4.2, please refer to the release notes in the following locations:

--

The post Announcing Databricks Runtime 4.2! appeared first on Databricks.

### Accelerating Conway’s Game of Life with SIMD instructions

Conway’s Game of Life is one of the simplest non-trivial simulation one can program. It simulates the emergence of life from chaos. Though the rules are simple, the game of life is still being studied for the last five decades.

The rules are simple. You have a grid where each cell in the grid has a single bit of state: it is either alive or dead. During each iteration, we look at the 8 neighbours of each cells and count the number of live neighbours. If a cell is dead but has exactly three live neighbours, it comes alive. If a live cell has more than 3 or less than 2 live neighbours, it dies. That is all.

Implemented in a straight-forward manner, the main loop might look like this…

for (i = 0; i < height; i++) {
for (j = 0; j < width; j++) {
bool alive = states[i][j];
int neighbours = count_live_neighbours[i][j];
if (alive) {
if (neighbours < 2 || neighbours > 3) {
states[coord] = false;
}
} else {
if (neighbours == 3) {
states[coord] = true;
}
}
}
}


However, if you implement it in this manner, it is hard for an optimizing compiler to generate clever code. For a 10,000 by 10,000 grid, my basic C implementation takes 0.5 seconds per iteration.

So I wondered whether I could rewrite the code in a vectorized manner, using the SIMD instructions available on our commodity processors. My first attempt brought this down to 0.02 seconds per iteration or about 25 times faster. My code is available.

 scalar (C) 0.5 s vectorized (C + SIMD) 0.02 s

I use 32-byte AVX2 intrinsics. I did not profile my code or do any kind of hard work.

Thoughts:

1. At a glance, I would guess that the limiting factor is the number of “loads”. An x64 processor can, at best, load two registers from memory per cycle and I have many loads. The arithmetic operations (additions, subtractions) probably come for free. My implementation uses 8 bits per cell whereas a single bit is sufficient. Going to more concise representation would reduce the number of loads by nearly an order of magnitude. My guess is that, on main CPUs, I am probably between a factor of 5 and 10 away from the optimal implementation. I expect that I am at least a factor of two away from the optimal speed.
2. The game-of-life problem is very similar to an image processing problem. It is a 3×3 moving/convolution filter. Tricks from image processing can be brought to bear. In particular, the problem a good fit for GPU processing.
3. I did not look at existing game-of-life implementations. I was mostly trying to come up with the answer by myself as quickly as possible. My bet would be on GPU implementations beating my implementation by a wide margin (orders of magnitude).

Update: John Regher points me to Hashlife as a better high-speed reference.

### Highlights from the useR! 2018 conference in Brisbane

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

he fourteenth annual worldwide R user conference, useR!2018, was held last week in Brisbane, Australia and it was an outstanding success. The conference attracted around 600 users from around the world and — as the first held in the Southern hemisphere — brought many first-time conference-goers to useR!. (There were also a number of beginning R users as well, judging from the attendance at the beginner's tutorial hosted by R-Ladies.) The program included 19 3-hour workshops, 6 keynote presentations, and more than 200 contributed talkslightning talks and posters on using, extending, and deploying R.

If you weren't able to make it to Brisbane, you can nonetheless relive the experience thanks to the recorded videos. Almost all of the tutorials, keynotes and talks are available to view for free, courtesy of the R Consortium. (A few remain to be posted, so keep an eye on the channel.) Here are a few of my personal highlights, based on talks I saw in Brisbane or have managed to catch online since then.

### Keynote talks

Steph de Silva, Beyond Syntax: on the power and potentiality of deep open source communities. A moving look at how open source communities, and especially R, grow and evolve.

Bill Venables, Adventures with R. It was wonderful to see the story and details behind an elegantly designed experiment investigating spoken language, and this example was used to great effect to contrast the definitions of "Statistics" and "Data Science". Bill also includes the best piece advice to give anyone joining a specialized group: "Everyone here is smart; distinguish yourself by being kind".

Kelly O'Brian's short history of RStudio was an interesting look at the impact of RStudio (the IDE and the company) on the R ecosystem.

Thomas Lin Pedersen, The Grammar of Graphics. A really thought-provoking talk about the place of animations in the sphere of data visualization, and an introduction to the gganimate package which extends ggplot2 in a really elegant and powerful way.

Danielle Navarro, R for Pysychological Science. A great case study in introducing statistical programming to social scientists.

Roger Peng, Teaching R to New Users. A fascinating history of the R project, and how changes in the user community have been reflected in changes in programming frameworks. The companion essay summarizes the talk clearly and concisely.

Jenny Bryan, Code Smells. This was an amazing talk with practical recommendations for better R coding practices. The video isn't online yet, but the slides are available to view online.

### Contributed talks

Bryan Galvin, Moving from Prototype to Production in R, a look inside the machine learning infrastructure at Netflix. Who says R doesn't scale?

Peter Dalgaard, What's in a Name? The secrets of the R build and release process, and the story behind their codenames.

Martin Maechler, Helping R to be (even more) Accurate. On R's near-obsessive attention to the details of computational accuracy.

Rob Hyndman, Tidy Forecasting in R. The next generation of time series forecasting methods in R.

Nicholas Tierney, Maxcovr: Find the best locations for facilities using the maximal covering location problem. Giftastic!

David Smith Speeding up computations in R with parallel programming in the cloud. My talk on the doAzureParallel package.

David Smith, The Voice of the R Community. My talk for the R Consortium with the results of their community survey.

In addition, several of my colleagues from Microsoft were in attendance (Microsoft was a proud Platinum sponsor of useR!2018) and delivered talks of their own:

Angus Taylor, Deep Learning at Scale with Azure Batch AI

Miguel Fierro, Spark on Demand with AZTK

Overall, I thought useR!2018 was a wonderful conference. Great talks, friendly people, and impeccably organized. Kudos to all of the organizing committee, and particularly Di Cook, for putting together such a fantastic event. Next year's conference will be held in Toulouse, France and already has a great set of keynote speakers announced. But in the meantime, you can catch up on the talks from useR!2018 at the R Consortium YouTube channel linked below.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Experiment design and modeling for long-term studies in ads

by HENNING HOHNHOLD, DEIRDRE O'BRIEN, and DIANE TANG

In this post we discuss the challenges in measuring and modeling the long-term effect of ads on user behavior. We describe experiment designs which have proven effective for us and discuss the subtleties of trying to generalize the results via modeling.

A/B testing is used widely in information technology companies to guide product development and improvements. For questions as disparate as website design and UI, prediction algorithms, or user flows within apps, live traffic tests help developers understand what works well for users and the business, and what doesn’t.

Nevertheless, A/B testing has challenges and blind spots, such as:
1. the difficulty of identifying suitable metrics that give "works well" a measurable meaning. This is essentially the same as finding a truly useful objective to optimize.
2. capturing long-term user behavior changes that develop over time periods exceeding the typical duration of A/B tests, say, over several months rather than a few days.
3. accounting for effects "orthogonal" to the randomization used in experimentation. For example in ads, experiments using cookies (users) as experimental units are not suited to capture the impact of a treatment on advertisers or publishers nor their reaction to it.
A small but persistent team of data scientists within Google’s Search Ads has been pursuing item #2 since about 2008, leading to a much improved understanding of the long-term user effects we miss when running typical short A/B tests. This work has also resulted in advances for item #1, as it helped us define more useful objectives for A/B tests in Search Ads which include long-term impact of experimental treatments.

Recently, we presented some basic insights from our effort to measure and predict long-term effects at KDD 2015 [1]. In this blog post, we summarize that paper and refer you to it for details. Since we work in Google’s Search Ads group, the long-term effects our studies focus on are ads blindness and sightedness, that is, changes in users’ propensity to interact with the ads on Google’s search results page. However, much of the methodology is not ads-specific and should help investigate other questions such as long-term changes in website visit patterns or UI feature usage.

### A/A tests and long-term effects

In our quest to measure long-term user effects, we found a neat and surprising use case for A/A tests, i.e., experiments where treatment and control units receive identical treatments. Typically, these are used at Google to diagnose problems in the experiment infrastructure or undesired biases between treatment and control cohorts for A/B tests. Amazingly, in the context of long-term studies such "undesirable biases" turn into the main object of interest!

To see this, imagine you want to study long-term effects in an A/B test. The first thing you’ll want to do is to run your test for a long time with fixed experimental units, in our case cookies. Doing this affords time for long-term effects to develop and manifest themselves. The principal challenge is now to isolate long-term effects from the primary impact of applying the A/B treatment. Unfortunately, this is difficult since long-term effects are often much more subtle than the primary A/B effects, and even small changes in the latter (due to seasonality, other serving system updates, etc.) can overshadow the long-term effects we are trying to measure. We have found it basically impossible to adjust for such changes in the primary A/B effects through modeling. To see why this is so difficult, note, first, that there is a large number of factors that could potentially affect the primary A/B effects. Second, even if we could account for all of them, it would still be difficult to predict which ones interact with the A/B treatment (most of them don’t), and what the effect would be. Last, even if we could pull a model together, it might lack sufficient credibility to justify business-critical decisions since conclusions would depend strongly on model assumptions (given the relatively small size of the long-term effects).

Better experimental design, rather than fancy modeling, turned out to be the key to progress on this question. An elegant way to circumvent the issue of changes in the primary A/B treatment over time is to include a "post-period" in the experimental setup. By this we mean an A/A test with the same experimental units as in our A/B test, run immediately after the A/B treatment. Figure 1 shows the post-period and also includes an A/A test pre-period to verify that our experiment setup and randomization work as intended.

 Figure 1: Pre-periods and post-periods

The simple but powerful rationale behind post-periods is that during an A/A comparison there are no “primary effects” and hence any differences between the test and control cohorts during the post-period are due to the preceding extended application of the A/B treatment to the experimental units. For Google search ads, we have found that this method gives reliable and repeatable measurements of user behavior changes caused by a wide variety of treatments.

An obvious downside of post-periods is that the measurement of long-term effects happens only after the end of the treatment period, which can last several months. In particular, no intermediate results become available, which is a real downer in practice.

This can be remedied by another addition to the experimental design, namely a second experiment serving the treatment B. The twist is that in this second experiment we re-randomize study participants daily. Since at Google the typical experimental unit is a cookie, we call this construct with daily re-randomization a cookie-day experiment (as opposed to the cookie experiments we’ve considered up to now, where the experimental units stay fixed across time). We take the cookies for our cookie-day experiment from a big pool of cookies that receive the control treatment whenever they are not randomized into our cookie-day experiment — which is almost always. Consequently, the longer-term aspects of their behavior are shaped by having experienced the control treatment.

On any given day of the treatment period, the cookie-day and cookie experiments serving B define a B/B test, which we call the cookie cookie-day (CCD) comparison. As in the post-period case, this allows us to attribute metric differences between the two groups to their previous differential exposure (A for the cookie-day experiment, and B for the cookie experiment).

A neat aspect of CCD is that it allows us to follow user behavior changes while they are happening. For example, Figure 2 shows the change in users’ propensity to click on Google search ads for 10 different system changes that vary ad load and ranking algorithms. Depending on the average ad quality the different cohorts are exposed to, their willingness to interact with our ads changes over time. We learn that the average user “attitude towards ads” in each of the 10 cohorts approaches a new equilibrium, and that the process can be approximated reasonably well by exponential curves with a common learning rate (dashed lines).

 Figure 2: User learning as measured by CCD experiments

### Note that all learning measurements here are taken at an aggregate (population) level, not on a per-user basis.Modeling long-term effects in ads

In addition to measuring long-term effects we’ve also made efforts to model them. This is attractive for many reasons, the most prominent being:
• running long-term studies is a lot of work, expensive, and takes a long time. Having reliable models predicting these effects enables us to take long-term effects into account without slowing down development.
• interpretable models help us understand what drives user behavior. This knowledge has influenced our decision-making way beyond the concrete cases we studied in detail.
The most important insight from our modeling efforts is that users’ attitude towards Google’s search ads is, above all, shaped by the average quality of the ads they experienced previously. More precisely, we learned that both the relevance of the ads shown and the experience after users click substantially influence their future propensity to interact with ads. For more details see [1] Section 4. A scatterplot of observations vs. predictions for a model of this type is given in Figure 3:

 Figure 3: Predicted vs. measured user learning.

The plot makes clear that our quality-based models can predict how users will react to a relatively large class changes to Google’s ads system. (Note that UI manipulations are absent here — we have found these to be much harder to understand from a modeling perspective). We use this knowledge to define objective functions to optimize our ads system with a view towards the long-term. In other words, we have created a long-term focused OEC (Overall Evaluation Criterion [2]) for online experiment evaluation.

You’ve probably noticed just how few data points the scatter-plot contains. That’s because each observation here is a long-term study, usually with a treatment duration of about three months. Hence generating suitable training data is challenging, and as a consequence we ran into the curious situation of dealing with extremely small data at Google. Over the years, our modeling efforts have taught us that in such a situation
• cross-validation may not be sufficient to prevent overfitting when the data is sparse and the set of available covariates is large. Moreover, not all training data is created equal — in our case several observations come from conceptually similar treatments. This additional structure must be taken into account, at the very least in creating cross-validation folds. Otherwise cross-validation RMSEs might seriously overstate prediction accuracy on new test data.
• choosing interpretable models is both appealing to humans and reduces the model space so as to improve prediction accuracy on test data.

Finally, the fact that our interpretable models say “quality makes users more engaged” also helps validate the overall measurement methodology. We get asked whether the long-term effects we measure may just be biases on long-lived cookies or a similar unnoticed failure of our experiment setup. This seems unlikely — why should quality reliably predict cookie biases? We’ve certainly performed many other checks such as negative controls, meaningful dose-response relationships, but it is nice to have this simple result as validation.

### Conclusion

We’ve described methods to measure and predict change in long-term user behavior. These methods have had lasting impact on ad serving at Google. For instance, in 2011 we altered the AdWords auction ranking function to account for the long-term impact of showing a given ad.  The function determines which ads can show on the SERP and in which order and this adjustment places greater emphasis on user satisfaction after ad clicks. Long-term studies played a crucial role both in the motivation and evaluation of this change (see [1] Section 5).