# My Data Science Blogs

## February 19, 2018

### R Packages worth a look

Easily Create Pretty Popup Messages (Modals) in ‘Shiny’ (shinyalert)
Easily create pretty popup messages (modals) in ‘Shiny’. A modal can contain text, images, OK/Cancel buttons, an input to get a response from the user, and many more customizable options.

Compress and Decompress ‘Snappy’ Encoded Data (snappier)
Compression and decompression with ‘Snappy’.

Tools to Support Evidence Synthesis (revtools)
Researchers commonly need to summarize scientific information, a process known as ‘evidence synthesis’. The first stage of a synthesis process (such as a systematic review or meta-analysis) is to download a list of references from academic search engines such as ‘Web of Knowledge’ or ‘Scopus’. This information can be sorted manually (the traditional approach to systematic review), or the user can draw on tools from machine learning to help them visualise patterns in the corpus. ‘revtools’ uses topic models to render ordinations of text drawn from article titles, keywords and abstracts, and allows the user to interactively select or exclude individual references, words or topics. ‘revtools’ does not currently provide tools for analysis of data drawn from those references, features that are available in other packages such as ‘metagear’ or ‘metafor’.

### If you did not already know

Multi-Robot Transfer Learning
Multi-robot transfer learning allows a robot to use data generated by a second, similar robot to improve its own behavior. The potential advantages are reducing the time of training and the unavoidable risks that exist during the training phase. Transfer learning algorithms aim to find an optimal transfer map between different robots. In this paper, we investigate, through a theoretical study of single-input single-output (SISO) systems, the properties of such optimal transfer maps. We first show that the optimal transfer learning map is, in general, a dynamic system. The main contribution of the paper is to provide an algorithm for determining the properties of this optimal dynamic map including its order and regressors (i.e., the variables it depends on). The proposed algorithm does not require detailed knowledge of the robots’ dynamics, but relies on basic system properties easily obtainable through simple experimental tests. We validate the proposed algorithm experimentally through an example of transfer learning between two different quadrotor platforms. Experimental results show that an optimal dynamic map, with correct properties obtained from our proposed algorithm, achieves 60-70% reduction of transfer learning error compared to the cases when the data is directly transferred or transferred using an optimal static map. …

Tensorial Mixture Models
We introduce a generative model, we call Tensorial Mixture Models (TMMs) based on mixtures of basic component distributions over local structures (e.g. patches in an image) where the dependencies between the local-structures are represented by a ‘priors tensor’ holding the prior probabilities of assigning a component distribution to each local-structure. In their general form, TMMs are intractable as the prior tensor is typically of exponential size. However, when the priors tensor is decomposed it gives rise to an arithmetic circuit which in turn transforms the TMM into a Convolutional Arithmetic Circuit (ConvAC). A ConvAC corresponds to a shallow (single hidden layer) network when the priors tensor is decomposed by a CP (sum of rank-1) approach and corresponds to a deep network when the decomposition follows the Hierarchical Tucker (HT) model. The ConvAC representation of a TMM possesses several attractive properties. First, the inference is tractable and is implemented by a forward pass through a deep network. Second, the architectural design of the model follows the deep networks community design, i.e., the structure of TMMs is determined by just two easily understood factors: size of pooling windows and number of channels. Finally, we demonstrate the effectiveness of our model when tackling the problem of classification with missing data, leveraging TMMs unique ability of tractable marginalization which leads to optimal classifiers regardless of the missingness distribution. …

Graph-based Activity Regularization (GAR)
In this paper, we propose a novel graph-based approach for semi-supervised learning problems, which considers an adaptive adjacency of the examples throughout the unsupervised portion of the training. Adjacency of the examples is inferred using the predictions of a neural network model which is first initialized by a supervised pretraining. These predictions are then updated according to a novel unsupervised objective which regularizes another adjacency, now linking the output nodes. Regularizing the adjacency of the output nodes, inferred from the predictions of the network, creates an easier optimization problem and ultimately provides that the predictions of the network turn into the optimal embedding. Ultimately, the proposed framework provides an effective and scalable graph-based solution which is natural to the operational mechanism of deep neural networks. Our results show state-of-the-art performance within semi-supervised learning with the highest accuracies reported to date in the literature for SVHN and NORB datasets. …

## February 18, 2018

### SwissPost putting another nail in the coffin of Swiss sovereignty

(This article was first published on DanielPocock.com - r-project, and kindly contributed to R-bloggers)

A few people have recently asked me about the SwissID, as SwissPost has just been sending spam emails out to people telling them “Link your Swiss Post user account to SwissID”.

SwissID is not the only digital identity solution in Switzerland but as it is run by SwissPost and has a name similar to another service it is becoming very well known.

In 2010 they began offering a solution which they call SuisseID (notice the difference?) based on digital certificates and compliant with Swiss legislation. Public discussion focussed on the obscene cost with little comment about the privacy consequences and what this means for Switzerland as a nation.

Digital certificates often embed an email address in the certificate.

With SwissID, however, they have a web site that looks like little more than vaporware, giving no details at all whether certificates are used. It appears they are basically promoting an app that is designed to harvest the email addresses and phone numbers of any Swiss people who install it, lulling them into that folly by using a name that looks like their original SuisseID. If it looks like phishing, if it feels like phishing and if it smells like phishing to any expert takes a brief sniff of their FAQ, then what else is it?

The thing is, the original SuisseID runs on a standalone smartcard so it doesn’t need to have your mobile phone number, have permissions to all the data in your phone and be limited to working in areas with mobile phone signal.

The emails currently being sent by SwissPost tell people they must “Please use a private e-mail address for this purpose” but they don’t give any information about the privacy consequences of creating such an account or what their app will do when it has access to read all the messages and contacts in your phone.

### The actions you can take that they didn’t tell you about

• You can post a registered letter to SwissPost and tell them that for privacy reasons, you are immediately retracting the email addresses and mobile phone numbers they currently hold on file and that you are exercising your right not to give an email address or mobile phone number to them in future.
• If you do decide you want a SwissID, create a unique email address for it and only use that email address with SwissPost so that it can’t be cross-referenced with other companies. This email address is also like a canary in a coal mine: if you start receiving spam on that email address then you know SwissPost/SwissID may have been hacked or the data has been leaked or sold.
• Don’t install their app and if you did, remove it and you may want to change your mobile phone number.

Oddly enough, none of these privacy-protecting ideas were suggested in the email from SwissPost. Who’s side are they on?

### Why should people be concerned?

SwissPost, like every postal agency, has seen traditional revenues drop and so they seek to generate more revenue from direct marketing and they are constantly looking for ways to extract and profit from data about the public. They are also a huge company with many employees: when dealing with vast amounts of data in any computer system, it only takes one employee to compromise everything: just think of how Edward Snowden was able to act alone to extract many of the NSA’s most valuable secrets.

SwissPost is going to great lengths to get accurate data on every citizen and resident in Switzerland, including deploying an app to get your mobile phone number and demanding an email address when you use their web site. That also allows them to cross-reference with your IP addresses.

• When you call a company from your mobile phone and their system recognizes your phone number, it becomes easier for them to match it to your home address.
• If SwissPost and the SBB successfully convince a lot of people to use a SwissID, some other large web sites may refuse to allow access without getting you to link them to your SwissID and all the data behind it too. Think of how many websites already try to coerce you to give them your mobile phone number and birthday to “secure” your account, but worse.

### The Google factor

The creepiest thing is that over seventy percent of people are apparently using Gmail addresses in Switzerland and these will be a dependency of their registration for SwissID.

Given that SwissID is being promoted as a solution compliant with ZertES legislation that can act as an interface between citizens and the state, the intersection with such a powerful foreign actor as Gmail is extraordinary. For example, if people are registering to vote in Switzerland’s renowned referendums and their communication is under the surveillance of a foreign power like the US, that is a mockery of democracy and it makes the allegations of Russian election hacking look like child’s play.

Switzerland’s referendums, decentralized system of Government, part-time army and privacy regime are all features that maintain a balance between citizen and state: by centralizing power in the hands of SwissID and foreign IT companies, doesn’t it appear that the very name SwissID is a mockery of the Swiss identity?

No canaries were harmed in the production of this blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### dplyr, (mc)lapply, for-loop and speed

(This article was first published on R – scottishsnow, and kindly contributed to R-bloggers)

I was at EdinbR on Friday evening and David Robinson‘s talk prompted some excellent discussions, not least with Isla and Gergana. One topic was on dplyr and lapply. I started using R in 2012, just before dplyr came to prominence and so I seem to have one foot in base and the other in the tidyverse. Ambitiously aiming for the best of both worlds!

I often use lapply to wrap up my scripts which clean and process files, but Isla pointed out I could do this with dplyr. I can’t figure out (with available Sunday time) how to use write_csv in a pipe and pass it an argument for a file name (please tell me in the comments!), without resorting to making a loop/*apply.

This post is for all those folk who might be considering wrapping some tidyverse commands in a function to make their code cleaner and hopefully gain some speed advantages.

For this example I’ve created a dummy set of raw data that we can make some arbitrary selection and transformation on and save as clean. In this case I’ve created dates with some observations happening each day, labelled as ‘a’ to ‘c’. Alongside this I’ve duplicated these columns to give us something to drop. Dummy data created with:

library(tidyverse)

# Dummy files to process
dir.create("./temp/raw", recursive=T)
dir.create("./temp/clean")

lapply(1950:2017, function(i){
date = seq.Date(as.Date(paste0(i, "-01-01")),
as.Date(paste0(i, "-12-31")),
by=1)
a = rnorm(length(date))
a1 = rnorm(length(date))
a2 = rnorm(length(date))
b = rpois(length(date), 10)
b1 = rpois(length(date), 10)
b2 = rpois(length(date), 10)
c = rexp(length(date), 5)
c1 = rexp(length(date), 5)
c2 = rexp(length(date), 5)

write_csv(data.frame(date, a, b, c),
paste0("./temp/raw/file_", i, ".csv"))
})


We’ve now got a directory with 68 csv files, each containing some fabricated daily data. In order to read files into R, the first thing to do is get a path to it, we can do this with list.files():

# Get a vector of file names
f = list.files("./temp/raw", pattern="file")


Now we have an object, f, which contains all our file names we can write a process to get them ready for analysis. I’m illustrating this by selecting 4 columns (date, a, b and c) and converting them to a long tidy format. I’ve had a stab at writing this process in tidyverse alone, but can’t figure out how to pass write_csv() a file name. I suspect the answer lies in turning f into a data frame with a column for in and a column for out. Seems pretty awkward to me. I welcome answers in the comments!

# Interactive dplyr
# Who knows what ??? should take, I don't!
system.time(
paste0("./temp/raw/", f) %>%
select(date, a, b, c) %>%
gather(variable, value, -date) %>%
write_csv(???)
)


The above doesn’t work, but we can adapt it slightly to make it a function. We’re now able to pass our tidyverse code individual file names, here represented by i. Finally, the clean data are written out into the clean folder. In the real world we may also want to change the file name to reflect the clean status.

# As a function
paste0("./temp/raw/", i) %>%
select(date, a, b, c) %>%
gather(variable, value, -date) %>%
write_csv(paste0("./temp/clean/", i))
}


Finally, we can run the above as a loop (usually bad), lapply or something else:

# Loop
for (j in f){
}

# lapply


But how fast were they? Can we get faster? Thankfully R provides system.time() for timing code execution. In order to get faster, it makes sense to use all the processing power our machines have. The ‘parallel’ library has some great tools to help us run our jobs in parallel and take advantage of multicore processing. My favourite is mclapply(), because it is very very easy to take an lapply and make it multicore. Note that mclapply doesn’t work on Windows. The following script runs the read_clean_write() function in a for loop (boo, hiss), lapply and mclapply. I’ve run these as list elements to make life easier later on.

library(parallel)

# Loop
times = list(
loop = system.time(
for (j in f){
}
),
lapply = system.time(
),
mclapply = system.time(
)
)


Next we can plot up these results. I’m using sapply to get only the elapsed time from the proc.time object, and then cleaning the elapsed part from the vector name.

Single run comparison of 3 loop methods in R.

x = sapply(times, function(i){i["elapsed"]})
names(x) = substr(names(x), 1, nchar(names(x)) - 8)

par(mar=c(5, 5, 4, 2) + 0.1)
barplot(x, names.arg=names(x),
main="Elapsed time on an Intel i5 4460 with 4 cores at 3.2GHz",
xlab="Seconds", horiz=T, las=1)
par(mar=c(5, 4, 4, 2) + 0.1)


Unsurprisingly mclapply is the clear winner. It’s spreading the work across four cores instead of one, so unless the job is very simple it will always be fastest!

Having run this code a few times, I noticed the results are not consistent. Because we’ve been working in code we can examine the variability. I’ve done this by running each method 100 times:

times = lapply(1:100, function(i){
x = list(
forloop=system.time(
for (j in f){
}
),
lapply = system.time(
),
mclapply = system.time(
)
)
x = sapply(x, function(k){k["elapsed"]})
names(x) = substr(names(x), 1, nchar(names(x))-8)
x
})

# Tidy
x = lapply(seq_along(times), function(i){
data.frame(run=i,
forloop=times[[i]]["forloop"],
lapply=times[[i]]["lapply"],
mclapply=times[[i]]["mclapply"])
})
x = do.call("rbind.data.frame", x)


My poor computer! Now we can plot these results up. I’ve chosen violin plots to help us see the distribution of results:

Many runs comparing loop methods in R.

png("./temp/violin.png", height=500, width=1000)
x %>%
gather(variable, value, -run) %>%
ggplot(aes(variable, value)) +
geom_violin(fill="grey70") +
labs(title="100 runs comparing for-loop, lapply and mclapply",
x="",
y="Seconds") +
coord_flip() +
theme_bw() +
theme(text = element_text(size=20),
panel.grid=element_blank())
dev.off()


Then, we can pull out median values for each:

 Method Time (seconds) forloop 1.5255 lapply 1.5200 mclapply 0.4515
x %>%
gather(variable, value, -run) %>%
group_by(variable) %>%
summarise(median(value))


Finally, what do I recommend you use? Occasionally I need to use a for loop (<1 a year), because using lapply is too difficult. It's nice to see I'm probably not incurring much penalty for this heresy. Generally, I believe lapply is an excellent solution, not least because if I need a speed boost I need only call library(parallel) and tell mclapply how many cores to use and I’m away. As it seems to be shout-out season (isn’t it always in the great R community?!), this book on efficient R programming by Colin and Robin is excellent!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Document worth reading: “Neural Networks for Information Retrieval”

Machine learning plays a role in many aspects of modern IR systems, and deep learning is applied in all of them. The fast pace of modern-day research has given rise to many approaches to many IR problems. The amount of information available can be overwhelming both for junior students and for experienced researchers looking for new research topics and directions. The aim of this full-day tutorial is to give a clear overview of current tried-and-trusted neural methods in IR and how they benefit IR. Neural Networks for Information Retrieval

### Basics of Entity Resolution

Entity resolution (ER) is the task of disambiguating records that correspond to real world entities across and within datasets. The applications of entity resolution are tremendous, particularly for public sector and federal datasets related to health, transportation, finance, law enforcement, and antiterrorism.

Unfortunately, the problems associated with entity resolution are equally big — as the volume and velocity of data grow, inference across networks and semantic relationships between entities becomes increasingly difficult. Data quality issues, schema variations, and idiosyncratic data collection traditions can all complicate these problems even further. When combined, such challenges amount to a substantial barrier to organizations’ ability to fully understand their data, let alone make effective use of predictive analytics to optimize targeting, thresholding, and resource management.

### Naming Your Problem

Let us first consider what an entity is. Much as the key step in machine learning is to determine what an instance is, the key step in entity resolution is to determine what an entity is. Let's define an entity as a unique thing (a person, a business, a product) with a set of attributes that describe it (a name, an address, a shape, a title, a price, etc.). That single entity may have multiple references across data sources, such as a person with two different email addresses, a company with two different phone numbers, or a product listed on two different websites. If we want to ask questions about all the unique people, or businesses, or products in a dataset, we must find a method for producing an annotated version of that dataset that contains unique entities.

How can we tell that these multiple references point to the same entity? What if the attributes for each entity aren't the same across references? What happens when there are more than two or three or ten references to the same entity? Which one is the main (canonical) version? Do we just throw the duplicates away?

Each question points to a single problem, albeit one that frequently goes unnamed. Ironically, one of the problems in entity resolution is that even though it goes by a lot of different names, many people who struggle with entity resolution do not know the name of their problem.

The three primary tasks involved in entity resolution are deduplication, record linkage, and canonicalization:

1. Deduplication: eliminating duplicate (exact) copies of repeated data.
2. Record linkage: identifying records that reference the same entity across different sources.
3. Canonicalization: converting data with more than one possible representation into a standard form.

Entity resolution is not a new problem, but thanks to Python and new machine learning libraries, it is an increasingly achievable objective. This post will explore some basic approaches to entity resolution using one of those tools, the Python Dedupe library. In this post, we will explore the basic functionalities of Dedupe, walk through how the library works under the hood, and perform a demonstration on two different datasets.

Dedupe is a library that uses machine learning to perform deduplication and entity resolution quickly on structured data. It isn't the only tool available in Python for doing entity resolution tasks, but it is the only one (as far as we know) that conceives of entity resolution as it's primary task. In addition to removing duplicate entries from within a single dataset, Dedupe can also do record linkage across disparate datasets. Dedupe also scales fairly well — in this post we demonstrate using the library with a relatively small dataset of a few thousand records and a very large dataset of several million.

### How Dedupe Works

Effective deduplication relies largely on domain expertise. This is for two main reasons: first, because domain experts develop a set of heuristics that enable them to conceptualize what a canonical version of a record should look like, even if they've never seen it in practice. Second, domain experts instinctively recognize which record subfields are most likely to uniquely identify a record; they just know where to look. As such, Dedupe works by engaging the user in labeling the data via a command line interface, and using machine learning on the resulting training data to predict similar or matching records within unseen data.

### Testing Out Dedupe

Getting started with Dedupe is easy, and the developers have provided a convenient repo with examples that you can use and iterate on. Let's start by walking through the csv_example.py from the dedupe-examples. To get Dedupe running, we'll need to install unidecode, future, and dedupe.

In your terminal (we recommend doing so inside a virtual environment):

git clone https://github.com/DistrictDataLabs/dedupe-examples.git
cd dedupe-examples

pip install unidecode
pip install future
pip install dedupe


Then we'll run the csv_example.py file to see what dedupe can do:

python csv_example.py


### Blocking and Affine Gap Distance

Let's imagine we own an online retail business, and we are developing a new recommendation engine that mines our existing customer data to come up with good recommendations for products that our existing and new customers might like to buy. Our dataset is a purchase history log where customer information is represented by attributes like name, telephone number, address, and order history. The database we've been using to log purchases assigns a new unique ID for every customer interaction.

But it turns out we're a great business, so we have a lot of repeat customers! We'd like to be able to aggregate the order history information by customer so that we can build a good recommender system with the data we have. That aggregation is easy if every customer's information is duplicated exactly in every purchase log. But what if it looks something like the table below?

How can we aggregate the data so that it is unique to the customer rather than the purchase? Features in the data set like names, phone numbers, and addresses will probably be useful. What is notable is that there are numerous variations for those attributes, particularly in how names appear — sometimes as nicknames, sometimes even misspellings. What we need is an intelligent and mostly automated way to create a new dataset for our recommender system. Enter Dedupe.

When comparing records, rather than treating each record as a single long string, Dedupe cleverly exploits the structure of the input data to instead compare the records field by field. The advantage of this approach is more pronounced when certain feature vectors of records are much more likely to assist in identifying matches than other attributes. Dedupe lets the user nominate the features they believe will be most useful:

fields = [
{'field' : 'Name', 'type': 'String'},
{'field' : 'Phone', 'type': 'Exact', 'has missing' : True},
{'field' : 'Address', 'type': 'String', 'has missing' : True},
{'field' : 'Purchases', 'type': 'String'},
]


Dedupe scans the data to create tuples of records that it will propose to the user to label as being either matches, not matches, or possible matches. These uncertainPairs are identified using a combination of blocking , affine gap distance, and active learning.

Blocking is used to reduce the number of overall record comparisons that need to be made. Dedupe's method of blocking involves engineering subsets of feature vectors (these are called 'predicates') that can be compared across records. In the case of our people dataset above, the predicates might be things like:

• the first three digits of the phone number
• the full name
• the first five characters of the name
• a random 4-gram within the city name

Records are then grouped, or blocked, by matching predicates so that only records with matching predicates will be compared to each other during the active learning phase. The blocks are developed by computing the edit distance between predicates across records. Dedupe uses a distance metric called affine gap distance, which is a variation on Hamming distance that makes subsequent consecutive deletions or insertions cheaper.

Therefore, we might have one blocking method that groups all of the records that have the same area code of the phone number. This would result in three predicate blocks: one with a 202 area code, one with a 334, and one with NULL. There would be two records in the 202 block (IDs 452 and 821), two records in the 334 block (IDs 233 and 699), and one record in the NULL area code block (ID 720).

The relative weight of these different feature vectors can be learned during the active learning process and expressed numerically to ensure that features that will be most predictive of matches will be heavier in the overall matching schema. As the user labels more and more tuples, Dedupe gradually relearns the weights, recalculates the edit distances between records, and updates its list of the most uncertain pairs to propose to the user for labeling.

Once the user has generated enough labels, the learned weights are used to calculate the probability that each pair of records within a block is a duplicate or not. In order to scale the pairwise matching up to larger tuples of matched records (in the case that entities may appear more than twice within a document), Dedupe uses hierarchical clustering with centroidal linkage. Records within some threshold distance of a centroid will be grouped together. The final result is an annotated version of the original dataset that now includes a centroid label for each record.

## Active Learning

You can see that dedupe is a command line application that will prompt the user to engage in active learning by showing pairs of entities and asking if they are the same or different.

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


Active learning is the so-called special sauce behind Dedupe. As in most supervised machine learning tasks, the challenge is to get labeled data that the model can learn from. The active learning phase in Dedupe is essentially an extended user-labeling session, which can be short if you have a small dataset and can take longer if your dataset is large. You are presented with four options:

You can experiment with typing the y, n, and u keys to flag duplicates for active learning. When you are finished, enter f to quit.

• (y)es: confirms that the two references are to the same entity
• (n)o: labels the two references as not the same entity
• (u)nsure: does not label the two references as the same entity or as different entities
• (f)inished: ends the active learning session and triggers the supervised learning phase

As you can see in the example above, some comparisons decisions are very easy. The first contains zero for zero hits on all four attributes being examined, so the verdict is most certainly a non-match. On the second, we have a 3/4 exact match, with the fourth being fuzzy in that one entity contains a piece of the matched entity; Ryerson vs. Chicago Public Schools Ryerson. A human would be able to discern these as two references to the same entity, and we can label it as such to enable the supervised learning that comes after the active learning.

The csv_example also includes an evaluation script that will enable you to determine how successfully you were able to resolve the entities. It's important to note that the blocking, active learning and supervised learning portions of the deduplication process are very dependent on the dataset attributes that the user nominates for selection. In the csv_example, the script nominates the following four attributes:

fields = [
{'field' : 'Site name', 'type': 'String'},
{'field' : 'Address', 'type': 'String'},
{'field' : 'Zip', 'type': 'Exact', 'has missing' : True},
{'field' : 'Phone', 'type': 'String', 'has missing' : True},
]


A different combination of attributes would result in a different blocking, a different set of uncertainPairs, a different set of features to use in the active learning phase, and almost certainly a different result. In other words, user experience and domain knowledge factor in heavily at multiple phases of the deduplication process.

## Something a Bit More Challenging

In order to try out Dedupe with a more challenging project, we decided to try out deduplicating the White House visitors' log. Our hypothesis was that it would be interesting to be able to answer questions such as "How many times has person X visited the White House during administration Y?" However, in order to do that, it would be necessary to generate a version of the list that contained unique entities. We guessed that there would be many cases where there were multiple references to a single entity, potentially with slight variations in how they appeared in the dataset. We also expected to find a lot of names that seemed similar but in fact referenced different entities. In other words, a good challenge!

The data set we used was pulled from the WhiteHouse.gov website, a part of the executive initiative to make federal data more open to the public. This particular set of data is a list of White House visitor record requests from 2006 through 2010. Here's a snapshot of what the data looks like via the White House API.

The dataset includes a lot of columns, and for most of the entries, the majority of these fields are blank:

Database Field Field Description
NAMELAST Last name of entity
NAMEFIRST First name of entity
NAMEMID Middle name of entity
UIN Unique Identification Number
Type of Access Access type to White House
TOA Time of arrival
POA Post on arrival
TOD Time of departure
POD Post on departure
APPT_MADE_DATE When the appointment date was made
APPT_START_DATE When the appointment date is scheduled to start
APPT_END_DATE When the appointment date is scheduled to end
APPT_CANCEL_DATE When the appointment date was canceled
Total_People Total number of people scheduled to attend
LAST_UPDATEDBY Who was the last person to update this event
POST Classified as 'WIN'
LastEntryDate When the last update to this instance
TERMINAL_SUFFIX ID for terminal used to process visitor
visitee_namelast The visitee's last name
visitee_namefirst The visitee's first name
MEETING_LOC The location of the meeting
MEETING_ROOM The room number of the meeting
CALLER_NAME_LAST The authorizing person for the visitor's last name
CALLER_NAME_FIRST The authorizing person for the visitor's first name
CALLER_ROOM The authorizing person's room for the visitor
Description Description of the event or visit
RELEASE_DATE The date this set of logs were released to the public

Using the API, the White House Visitor Log Requests can be exported in a variety of formats to include, .json, .csv, and .xlsx, .pdf, .xlm, and RSS. However, it's important to keep in mind that the dataset contains over 5 million rows. For this reason, we decided to use .csv and grabbed the data using requests:

import requests

def getData(url,fname):
"""
"""
response = requests.get(url)
with open(fname, 'w') as f:
f.write(response.content)

ORIGFILE = "fixtures/whitehouse-visitors.csv"

getData(DATAURL,ORIGFILE)


Once downloaded, we can clean it up and load it into a database for more secure and stable storage.

## Tailoring the Code

Next, we'll discuss what is needed to tailor a dedupe example to get the code to work for the White House visitors log dataset. The main challenge with this dataset is its sheer size. First, we'll need to import a few modules and connect to our database:

import csv
import psycopg2
from dateutil import parser
from datetime import datetime

conn = None

DATABASE = your_db_name
USER = your_user_name
HOST = your_hostname

try:
print ("I've connected")
except:
print ("I am unable to connect to the database")
cur = conn.cursor()


The other challenge with our dataset are the numerous missing values and datetime formatting irregularities. We wanted to be able to use the datetime strings to help with entity resolution, so we wanted to get the formatting to be as consistent as possible. The following script handles both the datetime parsing and the missing values by combining Python's dateutil module and PostgreSQL's fairly forgiving 'varchar' type.

This function takes the csv data in as input, parses the datetime fields we're interested in ('lastname','firstname','uin','apptmade','apptstart','apptend', 'meeting_loc'.), and outputs a database table that retains the desired columns. Keep in mind this will take a while to run.

def dateParseSQL(nfile):
cur.execute('''CREATE TABLE IF NOT EXISTS visitors_er
(visitor_id SERIAL PRIMARY KEY,
lastname    varchar,
firstname   varchar,
uin         varchar,
apptstart   varchar,
apptend     varchar,
meeting_loc varchar);''')
conn.commit()
with open(nfile, 'rU') as infile:
for row in reader:
for field in DATEFIELDS:
if row[field] != '':
try:
dt = parser.parse(row[field])
row[field] = dt.toordinal()  # We also tried dt.isoformat()
except:
continue
sql = "INSERT INTO visitors_er(lastname,firstname,uin,apptmade,apptstart,apptend,meeting_loc) \
VALUES (%s,%s,%s,%s,%s,%s,%s)"
cur.execute(sql, (row[0],row[1],row[3],row[10],row[11],row[12],row[21],))
conn.commit()
print ("All done!")

dateParseSQL(ORIGFILE)


About 60 of our rows had ASCII characters, which we dropped using this SQL command:

delete from visitors where firstname ~ '[^[:ascii:]]' OR lastname ~ '[^[:ascii:]]';


For our deduplication script, we modified the PostgreSQL example as well as Dan Chudnov's adaptation of the script for the OSHA dataset.

import tempfile
import argparse
import csv
import os

import dedupe
import psycopg2
from psycopg2.extras import DictCursor


Initially, we wanted to try to use the datetime fields to deduplicate the entities, but dedupe was not a big fan of the datetime fields, whether in isoformat or ordinal, so we ended up nominating the following fields:

KEY_FIELD = 'visitor_id'
SOURCE_TABLE = 'visitors'

FIELDS =  [{'field': 'firstname', 'variable name': 'firstname',
'type': 'String','has missing': True},
{'field': 'lastname', 'variable name': 'lastname',
'type': 'String','has missing': True},
{'field': 'uin', 'variable name': 'uin',
'type': 'String','has missing': True},
{'field': 'meeting_loc', 'variable name': 'meeting_loc',
'type': 'String','has missing': True}
]


We modified a function Dan wrote to generate the predicate blocks:

def candidates_gen(result_set):
lset = set
block_id = None
records = []
i = 0
for row in result_set:
if row['block_id'] != block_id:
if records:
yield records

block_id = row['block_id']
records = []
i += 1

if i % 10000 == 0:
print ('{} blocks'.format(i))

smaller_ids = row['smaller_ids']
if smaller_ids:
smaller_ids = lset(smaller_ids.split(','))
else:
smaller_ids = lset([])

records.append((row[KEY_FIELD], row, smaller_ids))

if records:
yield records


And we adapted the method from the dedupe-examples repo to handle the active learning, supervised learning, and clustering steps:

def find_dupes(args):
deduper = dedupe.Dedupe(FIELDS)

with psycopg2.connect(database=args.dbname,
host='localhost',
cursor_factory=DictCursor) as con:
with con.cursor() as c:
c.execute('SELECT COUNT(*) AS count FROM %s' % SOURCE_TABLE)
row = c.fetchone()
count = row['count']
sample_size = int(count * args.sample)

print ('Generating sample of {} records'.format(sample_size))
with con.cursor('deduper') as c_deduper:
c_deduper.execute('SELECT visitor_id,lastname,firstname,uin,meeting_loc FROM %s' % SOURCE_TABLE)
temp_d = dict((i, row) for i, row in enumerate(c_deduper))
deduper.sample(temp_d, sample_size)
del(temp_d)

if os.path.exists(args.training):
with open(args.training) as tf:

print ('Starting active learning')
dedupe.convenience.consoleLabel(deduper)

print ('Starting training')
deduper.train(ppc=0.001, uncovered_dupes=5)

print ('Saving new training file to {}'.format(args.training))
with open(args.training, 'w') as training_file:
deduper.writeTraining(training_file)

deduper.cleanupTraining()

print ('Creating blocking_map table')
c.execute("""
DROP TABLE IF EXISTS blocking_map
""")
c.execute("""
CREATE TABLE blocking_map
(block_key VARCHAR(200), %s INTEGER)
""" % KEY_FIELD)

for field in deduper.blocker.index_fields:
print ('Selecting distinct values for "{}"'.format(field))
c_index = con.cursor('index')
c_index.execute("""
SELECT DISTINCT %s FROM %s
""" % (field, SOURCE_TABLE))
field_data = (row[field] for row in c_index)
deduper.blocker.index(field_data, field)
c_index.close()

print ('Generating blocking map')
c_block = con.cursor('block')
c_block.execute("""
SELECT * FROM %s
""" % SOURCE_TABLE)
full_data = ((row[KEY_FIELD], row) for row in c_block)
b_data = deduper.blocker(full_data)

print ('Inserting blocks into blocking_map')
csv_file = tempfile.NamedTemporaryFile(prefix='blocks_', delete=False)
csv_writer = csv.writer(csv_file)
csv_writer.writerows(b_data)
csv_file.close()

f = open(csv_file.name, 'r')
c.copy_expert("COPY blocking_map FROM STDIN CSV", f)
f.close()

os.remove(csv_file.name)

con.commit()

print ('Indexing blocks')
c.execute("""
CREATE INDEX blocking_map_key_idx ON blocking_map (block_key)
""")
c.execute("DROP TABLE IF EXISTS plural_key")
c.execute("DROP TABLE IF EXISTS plural_block")
c.execute("DROP TABLE IF EXISTS covered_blocks")
c.execute("DROP TABLE IF EXISTS smaller_coverage")

print ('Calculating plural_key')
c.execute("""
CREATE TABLE plural_key
(block_key VARCHAR(200),
block_id SERIAL PRIMARY KEY)
""")
c.execute("""
INSERT INTO plural_key (block_key)
SELECT block_key FROM blocking_map
GROUP BY block_key HAVING COUNT(*) > 1
""")

print ('Indexing block_key')
c.execute("""
CREATE UNIQUE INDEX block_key_idx ON plural_key (block_key)
""")

print ('Calculating plural_block')
c.execute("""
CREATE TABLE plural_block
AS (SELECT block_id, %s
FROM blocking_map INNER JOIN plural_key
USING (block_key))
""" % KEY_FIELD)

print ('Adding {} index'.format(KEY_FIELD))
c.execute("""
CREATE INDEX plural_block_%s_idx
ON plural_block (%s)
""" % (KEY_FIELD, KEY_FIELD))
c.execute("""
CREATE UNIQUE INDEX plural_block_block_id_%s_uniq
ON plural_block (block_id, %s)
""" % (KEY_FIELD, KEY_FIELD))

print ('Creating covered_blocks')
c.execute("""
CREATE TABLE covered_blocks AS
(SELECT %s,
string_agg(CAST(block_id AS TEXT), ','
ORDER BY block_id) AS sorted_ids
FROM plural_block
GROUP BY %s)
""" % (KEY_FIELD, KEY_FIELD))

print ('Indexing covered_blocks')
c.execute("""
CREATE UNIQUE INDEX covered_blocks_%s_idx
ON covered_blocks (%s)
""" % (KEY_FIELD, KEY_FIELD))
print ('Committing')

print ('Creating smaller_coverage')
c.execute("""
CREATE TABLE smaller_coverage AS
(SELECT %s, block_id,
TRIM(',' FROM split_part(sorted_ids,
CAST(block_id AS TEXT), 1))
AS smaller_ids
FROM plural_block
INNER JOIN covered_blocks
USING (%s))
""" % (KEY_FIELD, KEY_FIELD))
con.commit()

print ('Clustering...')
c_cluster = con.cursor('cluster')
c_cluster.execute("""
SELECT *
FROM smaller_coverage
INNER JOIN %s
USING (%s)
ORDER BY (block_id)
""" % (SOURCE_TABLE, KEY_FIELD))
clustered_dupes = deduper.matchBlocks(
candidates_gen(c_cluster), threshold=0.5)

print ('Creating entity_map table')
c.execute("DROP TABLE IF EXISTS entity_map")
c.execute("""
CREATE TABLE entity_map (
%s INTEGER,
canon_id INTEGER,
cluster_score FLOAT,
PRIMARY KEY(%s)
)""" % (KEY_FIELD, KEY_FIELD))

print ('Inserting entities into entity_map')
for cluster, scores in clustered_dupes:
cluster_id = cluster[0]
for key_field, score in zip(cluster, scores):
c.execute("""
INSERT INTO entity_map
(%s, canon_id, cluster_score)
VALUES (%s, %s, %s)
""" % (KEY_FIELD, key_field, cluster_id, score))

c_cluster.close()
c.execute("CREATE INDEX head_index ON entity_map (canon_id)")
con.commit()

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--dbname', dest='dbname', default='whitehouse', help='database name')
parser.add_argument('-s', '--sample', default=0.10, type=float, help='sample size (percentage, default 0.10)')
parser.add_argument('-t', '--training', default='training.json', help='name of training file')
args = parser.parse_args()
find_dupes(args)


## Active Learning Observations

We ran multiple experiments:

• Test 1: lastname, firstname, meeting_loc => 447 (15 minutes of training)
• Test 2: lastname, firstname, uin, meeting_loc => 3385 (5 minutes of training) - one instance that had 168 duplicates

We observed a lot of uncertainty during the active learning phase, mostly because of how enormous the dataset is. This was particularly pronounced with names that seemed more common to us and that sounded more domestic since those are much more commonly occurring in this dataset. For example, are two records containing the name Michael Grant the same entity?

Additionally, we noticed that there were a lot of variations in the way that middle names were captured. Sometimes they were concatenated with the first name, other times with the last name. We also observed what seemed to be many nicknames or that could have been references to separate entities: KIM ASKEW vs. KIMBERLEY ASKEW and Kathy Edwards vs. Katherine Edwards (and yes, dedupe does preserve variations in case). On the other hand, since nicknames generally appear only in people's first names, when we did see a short version of a first name paired with an unusual or rare last name, we were more confident in labeling those as a match.

Other things that made the labeling easier were clearly gendered names (e.g. Brian Murphy vs. Briana Murphy), which helped us to identify separate entities in spite of very small differences in the strings. Some names appeared to be clear misspellings, which also made us more confident in our labeling two references as matches for a single entity (Davifd Culp vs. David Culp). There were also a few potential easter eggs in the dataset, which we suspect might actually be aliases (Jon Doe and Ben Jealous).

One of the things we discovered upon multiple runs of the active learning process is that the number of fields the user nominates to Dedupe for use has a great impact on the kinds of predicate blocks that are generated during the initial blocking phase. Thus, the comparisons that are presented to the trainer during the active learning phase. In one of our runs, we used only the last name, first name, and meeting location fields. Some of the comparisons were easy:

lastname : KUZIEMKO
firstname : ILYANA
meeting_loc : WH

lastname : KUZIEMKO
firstname : ILYANA
meeting_loc : WH

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


Some were hard:

lastname : Desimone
firstname : Daniel
meeting_loc : OEOB

lastname : DeSimone
firstname : Daniel
meeting_loc : WH

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


## Results

What we realized from this is that there are two different kinds of duplicates that appear in our dataset. The first kind of duplicate is one that generated via (likely mistaken) duplicate visitor request forms. We noticed that these duplicate entries tended to be proximal to each other in terms of visitor_id number, have the same meeting location and the same uin (which confusingly, is not a unique guest identifier but appears to be assigned to every visitor within a unique tour group). The second kind of duplicate is what we think of as the frequent flier — people who seem to spend a lot of time at the White House like staffers and other political appointees.

During the dedupe process, we computed there were 332,606 potential duplicates within the data set of 1,048,576 entities. For this particular data, we would expect these kinds of figures, knowing that people visit for repeat business or social functions.

### Within-Visit Duplicates

lastname : Ryan
meeting_loc : OEOB
firstname : Patrick
uin : U62671

lastname : Ryan
meeting_loc : OEOB
firstname : Patrick
uin : U62671


### Across-Visit Duplicates (Frequent Fliers)

lastname : TANGHERLINI
meeting_loc : OEOB
firstname : DANIEL
uin : U02692

lastname : TANGHERLINI
meeting_loc : NEOB
firstname : DANIEL
uin : U73085

lastname : ARCHULETA
meeting_loc : WH
firstname : KATHERINE
uin : U68121

lastname : ARCHULETA
meeting_loc : OEOB
firstname : KATHERINE
uin : U76331


## Conclusion

In this beginners guide to Entity Resolution, we learned what it means to identify entities and their possible duplicates within and across records. To further examine this data beyond the scope of this blog post, we would like to determine which records are true duplicates. This would require additional information to canonicalize these entities, thus allowing for potential indexing of entities for future assessments. Ultimately we discovered the importance of entity resolution across a variety of domains, such as counter-terrorism, customer databases, and voter registration.

Please return to the District Data Labs blog for upcoming posts on entity resolution and discussion about a number of other important topics to the data science community. Upcoming post topics from our research group include string matching algorithms, data preparation, and entity identification!

District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!

### Data Exploration with Python, Part 2

This is the second post in our Data Exploration with Python series. Before reading this post, make sure to check out Data Exploration with Python, Part 1!

Mise en place (noun): In a professional kitchen, the disciplined organization and preparation of equipment and food before service begins.

When performing exploratory data analysis (EDA), it is important to not only prepare yourself (the analyst) but to prepare your data as well. As we discussed in the previous post, a small amount of preparation will often save you a significant amount of time later on. So let's review where we should be at this point and then continue our exploration process with data preparation.

In Part 1 of this series, we were introduced to the data exploration framework we will be using. As a reminder, here is what that framework looks like.

We also introduced the example data set we are going to be using to illustrate the different phases and stages of the framework. Here is what that looks like.

We then familiarized ourselves with our data set by identifying the types of information and entities encoded within it. We also reviewed several data transformation and visualization methods that we will use later to explore and analyze it. Now we are at the last stage of the framework's Prep Phase, the Create stage, where our goal will be to create additional categorical fields that will make our data easier to explore and allow us to view it from new perspectives.

## Renaming Columns to be More Intuitive

Before we dive in and start creating categories, however, we have an opportunity to improve our categorization efforts by examining the columns in our data and making sure their labels intuitively convey what they represent. Just as with the other aspects of preparation, changing them now will save us from having to remember what displ or co2TailpipeGpm mean when they show up on a chart later. In my experience, these small, detail-oriented enhancements to the beginning of your process usually compound and preserve cognitive cycles that you can later apply to extracting insights.

We can use the code below to rename the columns in our vehicles data frame.

vehicles.columns = ['Make','Model','Year','Engine Displacement','Cylinders',
'Transmission','Drivetrain','Vehicle Class','Fuel Type',
'Fuel Barrels/Year','City MPG','Highway MPG','Combined MPG',
'CO2 Emission Grams/Mile','Fuel Cost/Year']


## Thinking About Categorization

Now that we have changed our column names to be more intuitive, let's take a moment to think about what categorization is and examine the categories that currently exist in our data set. At the most basic level, categorization is just a way that humans structure information — how we hierarchically create order out of complexity. Categories are formed based on attributes that entities have in common, and they present us with different perspectives from which we can view and think about our data.

Our primary objective in this stage is to create additional categories that will help us further organize our data. This will prove beneficial not only for the exploratory analysis we will conduct but also for any supervised machine learning or modeling that may happen further down the data science pipeline. Seasoned data scientists know that the better your data is organized, the better downstream analyses you will be able to perform and the more informative features you will have to feed into your machine learning models.

In this stage of the framework, we are going to create additional categories in 3 distinct ways:

• Category Aggregations
• Binning Continuous Variables
• Clustering

Now that we have a better idea of what we are doing and why, let's get started.

### Aggregating to Higher-Level Categories

The first way we are going to create additional categories is by identifying opportunities to create higher-level categories out of the variables we already have in our data set. In order to do this, we need to get a sense of what categories currently exist in the data. We can do this by iterating through our columns and printing out the name, the number of unique values, and the data type for each.

def unique_col_values(df):
for column in df:
print("{} | {} | {}".format(
df[column].name, len(df[column].unique()), df[column].dtype
))

unique_col_values(vehicles)

Make | 126 | object
Model | 3491 | object
Year | 33 | int64
Engine Displacement | 65 | float64
Cylinders | 9 | float64
Transmission | 43 | object
Drivetrain | 7 | object
Vehicle Class | 34 | object
Fuel Type | 13 | object
Fuel Barrels/Year | 116 | float64
City MPG | 48 | int64
Highway MPG | 49 | int64
Combined MPG | 45 | int64
CO2 Emission Grams/Mile | 550 | float64
Fuel Cost/Year | 58 | int64


From looking at the output, it is clear that we have some numeric columns (int64 and float64) and some categorical columns (object). For now, let's focus on the six categorical columns in our data set.

• Make: 126 unique values
• Model: 3,491 unique values
• Transmission: 43 unique values
• Drivetrain: 7 unique values
• Vehicle Class: 34 unique values
• Fuel Type: 13 unique values

When aggregating and summarizing data, having too many categories can be problematic. The average human is said to have the ability to hold 7 objects at a time in their short-term working memory. Accordingly, I have noticed that once you exceed 8-10 discrete values in a category, it becomes increasingly difficult to get a holistic picture of how the entire data set is divided up.

What we want to do is examine the values in each of our categorical variables to determine where opportunities exist to aggregate them into higher-level categories. The way this is typically done is by using a combination of clues from the current categories and any domain knowledge you may have (or be able to acquire).

For example, imagine aggregating by Transmission, which has 43 discrete values in our data set. It is going to be difficult to derive insights due to the fact that any aggregated metrics are going to be distributed across more categories than you can hold in short-term memory. However, if we examine the different transmission categories with the goal of finding common features that we can group on, we would find that all 43 values fall into one of two transmission types, Automatic or Manual.

Let's create a new Transmission Type column in our data frame and, with the help of the loc method in pandas, assign it a value of Automatic where the first character of Transmission is the letter A and a value of Manual where the first character is the letter M.

AUTOMATIC = "Automatic"
MANUAL = "Manual"

vehicles.loc[vehicles['Transmission'].str.startswith('A'),
'Transmission Type'] = AUTOMATIC

vehicles.loc[vehicles['Transmission'].str.startswith('M'),
'Transmission Type'] = MANUAL


We can apply the same logic to the Vehicle Class field. We originally have 34 vehicle classes, but we can distill those down into 8 vehicle categories, which are much easier to remember.

small = ['Compact Cars','Subcompact Cars','Two Seaters','Minicompact Cars']
midsize = ['Midsize Cars']
large = ['Large Cars']

vehicles.loc[vehicles['Vehicle Class'].isin(small),
'Vehicle Category'] = 'Small Cars'

vehicles.loc[vehicles['Vehicle Class'].isin(midsize),
'Vehicle Category'] = 'Midsize Cars'

vehicles.loc[vehicles['Vehicle Class'].isin(large),
'Vehicle Category'] = 'Large Cars'

vehicles.loc[vehicles['Vehicle Class'].str.contains('Station'),
'Vehicle Category'] = 'Station Wagons'

vehicles.loc[vehicles['Vehicle Class'].str.contains('Truck'),
'Vehicle Category'] = 'Pickup Trucks'

vehicles.loc[vehicles['Vehicle Class'].str.contains('Special Purpose'),
'Vehicle Category'] = 'Special Purpose'

vehicles.loc[vehicles['Vehicle Class'].str.contains('Sport Utility'),
'Vehicle Category'] = 'Sport Utility'

vehicles.loc[(vehicles['Vehicle Class'].str.lower().str.contains('van')),
'Vehicle Category'] = 'Vans & Minivans'


Next, let's look at the Make and Model fields, which have 126 and 3,491 unique values respectively. While I can't think of a way to get either of those down to 8-10 categories, we can create another potentially informative field by concatenating Make and the first word of the Model field together into a new Model Type field. This would allow us to, for example, categorize all Chevrolet Suburban C1500 2WD vehicles and all Chevrolet Suburban K1500 4WD vehicles as simply Chevrolet Suburbans.

vehicles['Model Type'] = (vehicles['Make'] + " " +
vehicles['Model'].str.split().str.get(0))


Finally, let's look at the Fuel Type field, which has 13 unique values. On the surface, that doesn't seem too bad, but upon further inspection, you'll notice some complexity embedded in the categories that could probably be organized more intuitively.

vehicles['Fuel Type'].unique()

array(['Regular', 'Premium', 'Diesel', 'Premium and Electricity',
'Premium or E85', 'Premium Gas or Electricity', 'Gasoline or E85',
'Gasoline or natural gas', 'CNG', 'Regular Gas or Electricity',
'Midgrade', 'Regular Gas and Electricity', 'Gasoline or propane'],
dtype=object)


This is interesting and a little tricky because there are some categories that contain a single fuel type and others that contain multiple fuel types. In order to organize this better, we will create two sets of categories from these fuel types. The first will be a set of columns that will be able to represent the different combinations, while still preserving the individual fuel types.

vehicles['Gas'] = 0
vehicles['Ethanol'] = 0
vehicles['Electric'] = 0
vehicles['Propane'] = 0
vehicles['Natural Gas'] = 0

vehicles.loc[vehicles['Fuel Type'].str.contains(

vehicles.loc[vehicles['Fuel Type'].str.contains('E85'),'Ethanol'] = 1

vehicles.loc[vehicles['Fuel Type'].str.contains('Electricity'),'Electric'] = 1

vehicles.loc[vehicles['Fuel Type'].str.contains('propane'),'Propane'] = 1

vehicles.loc[vehicles['Fuel Type'].str.contains('natural|CNG'),'Natural Gas'] = 1


As it turns out, 99% of the vehicles in our database have gas as a fuel type, either by itself or combined with another fuel type. Since that is the case, let's create a second set of categories - specifically, a new Gas Type field that extracts the type of gas (Regular, Midgrade, Premium, Diesel, or Natural) each vehicle accepts.

vehicles.loc[vehicles['Fuel Type'].str.contains(
'Regular|Gasoline'),'Gas Type'] = 'Regular'

vehicles.loc[vehicles['Fuel Type'] == 'Midgrade',
'Gas Type'] = 'Midgrade'

'Gas Type'] = 'Premium'

vehicles.loc[vehicles['Fuel Type'] == 'Diesel',
'Gas Type'] = 'Diesel'

vehicles.loc[vehicles['Fuel Type'].str.contains('natural|CNG'),
'Gas Type'] = 'Natural'


An important thing to note about what we have done with all of the categorical fields in this section is that, while we created new categories, we did not overwrite the original ones. We created additional fields that will allow us to view the information contained within the data set at different (often higher) levels. If you need to drill down to the more granular original categories, you can always do that. However, now we have a choice whereas before we performed these category aggregations, we did not.

### Creating Categories from Continuous Variables

The next way we can create additional categories in our data is by binning some of our continuous variables - breaking them up into different categories based on a threshold or distribution. There are multiple ways you can do this, but I like to use quintiles because it gives me one middle category, two categories outside of that which are moderately higher and lower, and then two extreme categories at the ends. I find that this is a very intuitive way to break things up and provides some consistency across categories. In our data set, I've identified 4 fields that we can bin this way.

Binning essentially looks at how the data is distributed, creates the necessary number of bins by splitting up the range of values (either equally or based on explicit boundaries), and then categorizes records into the appropriate bin that their continuous value falls into. Pandas has a qcut method that makes binning extremely easy, so let's use that to create our quintiles for each of the continuous variables we identified.

efficiency_categories = ['Very Low Efficiency', 'Low Efficiency',
'Moderate Efficiency','High Efficiency',
'Very High Efficiency']

vehicles['Fuel Efficiency'] = pd.qcut(vehicles['Combined MPG'],
5, efficiency_categories)

engine_categories = ['Very Small Engine', 'Small Engine','Moderate Engine',
'Large Engine', 'Very Large Engine']

vehicles['Engine Size'] = pd.qcut(vehicles['Engine Displacement'],
5, engine_categories)

emission_categories = ['Very Low Emissions', 'Low Emissions',
'Moderate Emissions','High Emissions',
'Very High Emissions']

vehicles['Emissions'] = pd.qcut(vehicles['CO2 Emission Grams/Mile'],
5, emission_categories)

fuelcost_categories = ['Very Low Fuel Cost', 'Low Fuel Cost',
'Moderate Fuel Cost','High Fuel Cost',
'Very High Fuel Cost']

vehicles['Fuel Cost'] = pd.qcut(vehicles['Fuel Cost/Year'],
5, fuelcost_categories)


### Clustering to Create Additional Categories

The final way we are going to prepare our data is by clustering to create additional categories. There are a few reasons why I like to use clustering for this. First, it takes multiple fields into consideration together at the same time, whereas the other categorization methods only consider one field at a time. This will allow you to categorize together entities that are similar across a variety of attributes, but might not be close enough in each individual attribute to get grouped together.

Clustering also creates new categories for you automatically, which takes much less time than having to comb through the data yourself identifying patterns across attributes that you can form categories on. It will automatically group similar items together for you.

The third reason I like to use clustering is because it will sometimes group things in ways that you, as a human, may not have thought of. I'm a big fan of humans and machines working together to optimize analytical processes, and this is a good example of value that machines bring to the table that can be helpful to humans. I'll write more about my thoughts on that in future posts, but for now, let's move on to clustering our data.

The first thing we are going to do is isolate the columns we want to use for clustering. These are going to be columns with numeric values, as the clustering algorithm will need to compute distances in order to group similar vehicles together.

cluster_columns = ['Engine Displacement','Cylinders','Fuel Barrels/Year',
'City MPG','Highway MPG','Combined MPG',
'CO2 Emission Grams/Mile', 'Fuel Cost/Year']


Next, we want to scale the features we are going to cluster on. There are a variety of ways to normalize and scale variables, but I'm going to keep things relatively simple and just use Scikit-Learn's MaxAbsScaler, which will divide each value by the max absolute value for that feature. This will preserve the distributions in the data and convert the values in each field to a number between 0 and 1 (technically -1 and 1, but we don't have any negatives).

from sklearn import preprocessing
scaler = preprocessing.MaxAbsScaler()

vehicle_clusters = scaler.fit_transform(vehicles[cluster_columns])
vehicle_clusters = pd.DataFrame(vehicle_clusters, columns=cluster_columns)


Now that our features are scaled, let's write a couple of functions. The first function we are going to write is a kmeans_cluster function that will k-means cluster a given data frame into a specified number of clusters. It will then return a copy of the original data frame with those clusters appended in a column named Cluster.

from sklearn.cluster import KMeans

def kmeans_cluster(df, n_clusters=2):
model = KMeans(n_clusters=n_clusters, random_state=1)
clusters = model.fit_predict(df)
cluster_results = df.copy()
cluster_results['Cluster'] = clusters
return cluster_results


Our second function, called summarize_clustering is going to count the number of vehicles that fall into each cluster and calculate the cluster means for each feature. It is going to merge the counts and means into a single data frame and then return that summary to us.

def summarize_clustering(results):
cluster_size = results.groupby(['Cluster']).size().reset_index()
cluster_size.columns = ['Cluster', 'Count']
cluster_means = results.groupby(['Cluster'], as_index=False).mean()
cluster_summary = pd.merge(cluster_size, cluster_means, on='Cluster')
return cluster_summary


We now have functions for what we need to do, so the next step is to actually cluster our data. But wait, our kmeans_cluster function is supposed to accept a number of clusters. How do we determine how many clusters we want?

There are a number of approaches for figuring this out, but for the sake of simplicity, we are just going to plug in a couple of numbers and visualize the results to arrive at a good enough estimate. Remember earlier in this post where we were trying to aggregate our categorical variables to less than 8-10 discrete values? We are going to apply the same logic here. Let's start out with 8 clusters and see what kind of results we get.

cluster_results = kmeans_cluster(vehicle_clusters, 8)
cluster_summary = summarize_clustering(cluster_results)


After running the couple of lines of code above, your cluster_summary should look similar to the following.

By looking at the Count column, you can tell that there are some clusters that have significantly more records in them (ex. Cluster 7) and others that have significantly fewer (ex. Cluster 3). Other than that, though, it is difficult to notice anything informative about the summary. I don't know about you, but to me, the rest of the summary just looks like a bunch of decimals in a table.

This is a prime opportunity to use a visualization to discover insights faster. With just a couple import statements and a single line of code, we can light this summary up in a heatmap so that we can see the contrast between all those decimals and between the different clusters.

import matplotlib.pyplot as plt
import seaborn as sns

sns.heatmap(cluster_summary[cluster_columns].transpose(), annot=True)


In this heatmap, the rows represent the features and the columns represent the clusters, so we can compare how similar or differently columns look to each other. Our goal for clustering these features is to ultimately create meaningful categories out of the clusters, so we want to get to the point where we can clearly distinguish one from the others. This heatmap allows us to do this quickly and visually.

With this goal in mind, it is apparent that we probably have too many clusters because:

• Clusters 3, 4, and 7 look pretty similar
• Clusters 2 and 5 look similar as well
• Clusters 0 and 6 are also a little close for comfort

From the way our heatmap currently looks, I'm willing to bet that we can cut the number of clusters in half and get clearer boundaries. Let's re-run the clustering, summary, and heatmap code for 4 clusters and see what kind of results we get.

cluster_results = kmeans_cluster(vehicle_clusters, 4)
cluster_summary = summarize_clustering(cluster_results)

sns.heatmap(cluster_summary[cluster_columns].transpose(), annot=True)


These clusters look more distinct, don't they? Clusters 1 and 3 look like they are polar opposites of each other, cluster 0 looks like it’s pretty well balanced across all the features, and cluster 2 looks like it’s about half-way between Cluster 0 and Cluster 1.

We now have a good number of clusters, but we still have a problem. It is difficult to remember what clusters 0, 1, 2, and 3 mean, so as a next step, I like to assign descriptive names to the clusters based on their properties. In order to do this, we need to look at the levels of each feature for each cluster and come up with intuitive natural language descriptions for them. You can have some fun and can get as creative as you want here, but just keep in mind that the objective is for you to be able to remember the characteristics of whatever label you assign to the clusters.

• Cluster 1 vehicles seem to have large engines that consume a lot of fuel, process it inefficiently, produce a lot of emissions, and cost a lot to fill up. I'm going to label them Large Inefficient.
• Cluster 3 vehicles have small, fuel efficient engines that don't produce a lot of emissions and are relatively inexpensive to fill up. I'm going to label them Small Very Efficient.
• Cluster 0 vehicles are fairly balanced across every category, so I'm going to label them Midsized Balanced.
• Cluster 2 vehicles have large engines but are more moderately efficient than the vehicles in Cluster 1, so I'm going to label them Large Moderately Efficient.

Now that we have come up with these descriptive names for our clusters, let's add a Cluster Name column to our cluster_results data frame, and then copy the cluster names over to our original vehicles data frame.

cluster_results['Cluster Name'] = ''
cluster_results['Cluster Name'][cluster_results['Cluster']==0] = 'Midsized Balanced'
cluster_results['Cluster Name'][cluster_results['Cluster']==1] = 'Large Inefficient'
cluster_results['Cluster Name'][cluster_results['Cluster']==2] = 'Large Moderately Efficient'
cluster_results['Cluster Name'][cluster_results['Cluster']==3] = 'Small Very Efficient'

vehicles = vehicles.reset_index().drop('index', axis=1)
vehicles['Cluster Name'] = cluster_results['Cluster Name']


## Conclusion

In this post, we examined several ways to prepare a data set for exploratory analysis. First, we looked at the categorical variables we had and attempted to find opportunities to roll them up into higher-level categories. After that, we converted some of our continuous variables into categorical ones by binning them into quintiles based on how relatively high or low their values were. Finally, we used clustering to efficiently create categories that automatically take multiple fields into consideration. The result of all this preparation is that we now have several columns containing meaningful categories that will provide different perspectives of our data and allow us to acquire as many insights as possible.

Now that we have these meaningful categories, our data set is in really good shape, which means that we can move on to the next phase of our data exploration framework. In the next post, we will cover the first two stages of the Explore Phase and demonstrate various ways to visually aggregate, pivot, and identify relationships between fields in our data. Make sure to subscribe to the DDL blog so that you get notified when we publish it!

District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!

### Data Exploration with Python, Part 3

This is the third post in our Data Exploration with Python series. Before reading this post, make sure to check out Part 1 and Part 2!

Preparing yourself and your data like we have done thus far in this series is essential to analyzing your data well. However, the most exciting part of Exploratory Data Analysis (EDA) is actually getting in there, exploring the data, and discovering insights. That's exactly what we are going to start doing in this post.

We will begin with the cleaned and prepped vehicle fuel economy data set that we ended up with at the end of the last post. This version of the data set contains:

• The higher-level categories we created via category aggregations.
• The quintiles we created by binning our continuous variables.
• The clusters we generated via k-means clustering based on numeric variables.

Now, without further ado, let's embark on our insight-finding mission!

## Making Our Data Smaller: Filter + Aggregate

One of the fundamental ways to extract insights from a data set is to reduce the size of the data so that you can look at just a piece of it at a time. There are two ways to do this: filtering and aggregating. With filtering, you are essentially removing either rows or columns (or both rows and columns) in order to focus on a subset of the data that interests you. With aggregation, the objective is to group records in your data set that have similar categorical attributes and then perform some calculation (count, sum, mean, etc.) on one or more numerical fields so that you can observe and identify differences between records that fall into each group.

To begin filtering and aggregating our data set, we could write a function like the one below to aggregate based on a group_field that we provide, counting the number of rows in each group. To make things more intuitive and easier to interpret, we will also sort the data from most frequent to least and format it in a pandas data frame with appropriate column names.

def agg_count(df, group_field):
grouped = df.groupby(group_field, as_index=False).size()
grouped.sort(ascending = False)

grouped = pd.DataFrame(grouped).reset_index()
grouped.columns = [group_field, 'Count']
return grouped


Now that we have this function in our toolkit, let's use it. Suppose we were looking at the Vehicle Category field in our data set and were curious about the number of vehicles in each category that were manufactured last year (2016). Here is how we would filter the data and use the agg_count function to transform it to show what we wanted to know.

vehicles_2016 = vehicles[vehicles['Year']==2016]
category_counts = agg_count(vehicles_2016, 'Vehicle Category')


This gives us what we want in tabular form, but we could take it a step further and visualize it with a horizontal bar chart.

ax = sns.barplot(data=category_counts, x='Count', y='Vehicle Category')
ax.set(xlabel='\n Number of Vehicles Manufactured')
sns.plt.title('Vehicles Manufactured by Category (2016) \n')


Now that we know how to do this, we can filter, aggregate, and plot just about anything in our data set with just a few lines of code. For example, here is the same metric but filtered for a different year (1985).

vehicles_1985 = vehicles[vehicles['Year']==1985]
category_counts = agg_count(vehicles, 'Vehicle Category')

ax = sns.barplot(data=category_counts, x='Count', y='Vehicle Category')
ax.set(xlabel='\n Number of Vehicles Manufactured')
sns.plt.title('Vehicles Manufactured by Category (1985) \n')


If we wanted to stick with the year 2016 but drill down to the more granular Vehicle Class, we could do that as well.

class_counts = agg_count(vehicles_2016, 'Vehicle Class')

ax = sns.barplot(data=class_counts, x='Count', y='Vehicle Class')
ax.set(xlabel='\n Number of Vehicles Manufactured')
sns.plt.title('Vehicles Manufactured by Class (2016) \n')


We could also look at vehicle counts by manufacturer.

make_counts = agg_count(vehicles_2016, 'Make')

ax = sns.barplot(data=make_counts, x='Count', y='Make')
ax.set(xlabel='\n Number of Vehicles Manufactured')
sns.plt.title('Vehicles Manufactured by Make (2016) \n')


What if we wanted to filter by something other than the year? We could do that by simply creating a different filtered data frame and passing that to our agg_count function. Below, instead of filtering by Year, I've filtered on the Fuel Efficiency field, which contains the fuel efficiency quintiles we generated in the last post. Let's choose the Very High Efficiency value so that we can see how many very efficient vehicles each manufacturer has made.

very_efficient = vehicles[vehicles['Fuel Efficiency']=='Very High Efficiency']
make_counts = agg_count(very_efficient, 'Make')

ax = sns.barplot(data=make_counts, x='Count', y='Make')
ax.set(xlabel='\n Number of Vehicles Manufactured')
sns.plt.title('Very Fuel Efficient Vehicles by Make \n')


What if we wanted to perform some other calculation, such as averaging, instead of counting the number of records that fall into each group? We can just create a new function called agg_avg that calculates the mean of a designated numerical field.

def agg_avg(df, group_field, calc_field):
grouped = df.groupby(group_field, as_index=False)[calc_field].mean()
grouped = grouped.sort(calc_field, ascending = False)
grouped.columns = [group_field, 'Avg ' + str(calc_field)]
return grouped


We can then simply swap out the agg_count function with our new agg_avg function and indicate what field we would like to use for our calculation. Below is an example showing the average fuel efficiency, represented by the Combined MPG field, by vehicle category.

category_avg_mpg = agg_avg(vehicles_2016, 'Vehicle Category', 'Combined MPG')

ax = sns.barplot(data=category_avg_mpg, x='Avg Combined MPG', y='Vehicle Category')
ax.set(xlabel='\n Average Combined MPG')
sns.plt.title('Average Combined MPG by Category (2016) \n')


## Pivoting the Data for More Detail

Up until this point, we've been looking at our data at a pretty high level, aggregating up by a single variable. Sure, we were able to drill down from Vehicle Category to Vehicle Class to get a more granular view, but we only looked at the data one hierarchical level at a time. Next, we're going to go into further detail by taking a look at two or three variables at a time. The way we are going to do this is via pivot tables and their visual equivalents, pivot heatmaps.

First, we will create a pivot_count function, similar to the agg_count function we created earlier, that will transform whatever data frame we feed it into a pivot table with the rows, columns, and calculated field we specify.

def pivot_count(df, rows, columns, calc_field):
df_pivot = df.pivot_table(values=calc_field,
index=rows,
columns=columns,
aggfunc=np.size
).dropna(axis=0, how='all')
return df_pivot


We will then use this function on our vehicles_2016 data frame and pivot it out with the Fuel Efficiency quintiles we created in the last post representing the rows, the Engine Size quintiles representing the columns, and then counting the number of vehicles that had a Combined MPG value.

effic_size_pivot = pivot_count(vehicles_2016,'Fuel Efficiency',
'Engine Size','Combined MPG')


This is OK, but it would be faster to analyze visually. Let's create a heatmap that will color the magnitude of the counts and present us with a more intuitive view.

sns.heatmap(effic_size_pivot, annot=True, fmt='g')
ax.set(xlabel='\n Engine Size')
sns.plt.title('Fuel Efficiency vs. Engine Size (2016) \n')


Just like we did earlier with our horizontal bar charts, we can easily filter by a different year and get a different perspective. For example, here's what this heatmap looks like for 1985.

effic_size_pivot = pivot_count(vehicles_1985,'Fuel Efficiency',
'Engine Size','Combined MPG')

fig, ax = plt.subplots(figsize=(15,8))
sns.heatmap(effic_size_pivot, annot=True, fmt='g')
ax.set(xlabel='\n Engine Size')
sns.plt.title('Fuel Efficiency vs. Engine Size (1985) \n')


With these pivot heatmaps, we are not limited to just two variables. We can pass a list of variables for any of the axes (rows or columns), and it will display all the different combinations of values for those variables.

effic_size_category = pivot_count(vehicles_2016,
['Engine Size','Fuel Efficiency'],
'Vehicle Category','Combined MPG')

fig, ax = plt.subplots(figsize=(20,10))
sns.heatmap(effic_size_category, annot=True, fmt='g')
ax.set(xlabel='\n Vehicle Category')
sns.plt.title('Fuel Efficiency + Engine Size vs. Vehicle Category (2016) \n')


In this heatmap, we have Engine Size and Fuel Efficiency combinations represented by the rows, and we've added a third variable (the Vehicle Category) across the columns. So now we can see a finer level of detail about what types of cars had what size engines and what level of fuel efficiency last year.

As a final example for this section, let's create a pivot heatmap that plots Make against Vehicle Category for 2016. We saw earlier, in the bar chart that counted vehicles by manufacturer, that BMW made the largest number of specific models last year. This pivot heatmap will let us see how those counts are distributed across vehicle categories, giving us a better sense of each auto company's current offerings in terms of the breadth vs. depth of vehicle types they make.

effic_size_pivot = pivot_count(vehicles_2016, 'Make',
'Vehicle Category','Combined MPG')

fig, ax = plt.subplots(figsize=(20,20))
sns.heatmap(effic_size_pivot, annot=True, fmt='g')
ax.set(xlabel='\n Vehicle Category')
sns.plt.title('Make vs. Vehicle Category (2016) \n')


## Visualizing Changes Over Time

So far in this post, we've been looking at the data at given points in time. The next step is to take a look at how the data has changed over time. We can do this relatively easily by creating a multi_line function that accepts a data frame and x/y fields and then plots them on a multiline chart.

def multi_line(df, x, y):
ax = df.groupby([x, y]).size().unstack(y).plot(figsize=(15,8), cmap="Set2")


Let's use this function to visualize our vehicle categories over time. The resulting chart shows the number of vehicles in each category that were manufactured each year.

multi_line(vehicles, 'Year', 'Vehicle Category')
ax.set(xlabel='\n Year')
sns.plt.title('Vehicle Categories Over Time \n')


We can see from the chart that Small Cars have generally dominated across the board and that there was a small decline in the late 90s that then started to pick up again in the early 2000s. We can also see the introduction and increase in popularity of SUVs starting in the late 90s, and the decline in popularity of trucks in recent years.

If we wanted to, we could zoom in and filter for specific manufacturers to see how their offerings have changed over the years. Since BMW had the most number of vehicles last year and we saw in the pivot heatmap that those were mostly small cars, let's filter for just their vehicles to see whether they have always made a lot of small cars or if this is more of a recent phenomenon.

bmw = vehicles[vehicles['Make'] == 'BMW']

multi_line(bmw, 'Year', 'Vehicle Category')
ax.set(xlabel='\n Year')
sns.plt.title('BMW Vehicle Categories Over Time \n')


We can see in the chart above that they started off making a reasonable number of small cars, and then seemed to ramp up production of those types of vehicles in the late 90s. We can contrast this with a company like Toyota, who started out making a lot of small cars back in the 1980s and then seemingly made a decision to gradually manufacture less of them over the years, focusing instead on SUVs, pickup trucks, and midsize cars.

toyota = vehicles[vehicles['Make'] == 'Toyota']

multi_line(toyota, 'Year', 'Vehicle Category')
ax.set(xlabel='\n Year')
sns.plt.title('Toyota Vehicle Categories Over Time \n')


## Examining Relationships Between Variables

The final way we are going to explore our data in this post is by examining the relationships between numerical variables in our data. Doing this will provide us with better insight into which fields are highly correlated, what the nature of those correlations are, what typical combinations of numerical values exist in our data, and which combinations are anomalies.

For looking at relationships between variables, I often like to start with a scatter matrix because it gives me a bird's eye view of the relationships between all the numerical fields in my data set. With just a couple lines of code, we can not only create a scatter matrix, but we can also factor in a layer of color that can represent, for example, the clusters we generated at the end of the last post.

select_columns = ['Engine Displacement','Cylinders','Fuel Barrels/Year',
'City MPG','Highway MPG','Combined MPG',
'CO2 Emission Grams/Mile', 'Fuel Cost/Year', 'Cluster Name']

sns.pairplot(vehicles[select_columns], hue='Cluster Name', size=3)


From here, we can see that there are some strong positive linear relationships in our data, such as the correlations between the MPG fields, and also among the fuel cost, barrels, and CO2 emissions fields. There are also some hyperbolic relationships in there as well, particularly between the MPG fields and engine displacement, fuel cost, barrels, and emissions. Additionally, we can also get a sense of the size of our clusters, how they are distributed, and the level of overlap we have between them.

Once we have this high-level overview, we can zoom in on anything that we think looks interesting. For example, let's take a closer look at Engine Displacement plotted against Combined MPG.

sns.lmplot('Engine Displacement', 'Combined MPG', data=vehicles,
hue='Cluster Name', size=8, fit_reg=False)


In addition to being able to see that there is a hyperbolic correlation between these two variables, we can see that our Small Very Efficient cluster resides in the upper left, followed by our Midsized Balanced cluster that looks smaller and more compact than the others. After that, we have our Large Moderately Efficient cluster and finally our Large Inefficient cluster on the bottom right.

We can also see that there are a few red points at the very top left and a few purple points at the very bottom right that we may want to investigate further to get a sense of what types of vehicles we are likely to see at the extremes. Try identifying some of those on your own by filtering the data set like we did earlier in the post. While you're at it, try creating additional scatter plots that zoom in on other numerical field combinations from the scatter matrix above. There are a bunch of other insights to be found in this data set, and all it takes is a little exploration!

## Conclusion

We have covered quite a bit in this post, and I hope I've provided you with some good examples of how, with just a few tools in your arsenal, you can embark on a robust insight-finding expedition and discover truly interesting things about your data. Now that you have some structure in your process and some tools for exploring data, you can let your creativity run wild a little and come up with filter, aggregate, pivot, and scatter combinations that are most interesting to you. Feel free to experiment and post any interesting insights you're able to find in the comments!

Also, make sure to stay tuned because in the next (and final) post of this series, I'm going to cover how to identify and think about the different networks that are present in your data and how to explore them using graph analytics. Click the subscribe button below and enter your email so that you receive a notification as soon as it's published!

District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!

### Book Memo: “Statistical Modeling for Degradation Data”

 This book focuses on the statistical aspects of the analysis of degradation data. In recent years, degradation data analysis has come to play an increasingly important role in different disciplines such as reliability, public health sciences, and finance. For example, information on products’ reliability can be obtained by analyzing degradation data. In addition, statistical modeling and inference techniques have been developed on the basis of different degradation measures. The book brings together experts engaged in statistical modeling and inference, presenting and discussing important recent advances in degradation data analysis and related applications. The topics covered are timely and have considerable potential to impact both statistics and reliability engineering.

### Sunday Morning Insight: LightOn Cloud: Light Based Technology for AI on the Cloud

As some of you may know, part of the reason I am little less active on Nuit Blanche these days stems from being involved with LightOn. At LightOn, we build hardware that uses light to perform computations of interest to Machine Learning, in short, we bring light to AI

Quite simply we are building a hardware product that does random projections... for now. If you are a student of history or if you know the history of how technologies begin and thrive, it is essential for that technology to meet its eventual end users very early on.

At LightOn, we want to get as much feedback as possible from the Machine Learning community as early as possible. And so for the past year, we have been working on integrating our technology so that it can be accessible on the web.

Thanks to the OVH Labs program, we got one of our prototype to run in a nearby data center. On December 20th, we had our first light and it was beautiful.

Since then we have been going through our Verification and Validation (V\&V) program and started to run some algorithms on it. On Friday, we issued a press release on opening up our cloud to the Machine Learning community. If you want to be a beta user on our cluster, please register your interest herehttps://goo.gl/6KDc26

Forward we go !

How to find us on the web ?

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

### Low power and the replication crisis: What have we learned since 2004 (or 1984, or 1964)?

I happened to run across this article from 2004, “The Persistence of Underpowered Studies in Psychological Research: Causes, Consequences, and Remedies,” by Scott Maxwell and published in the journal Psychological Methods.

In this article, Maxwell covers a lot of the material later discussed in the paper Power Failure by Button et al. (2013), and the 2014 paper on Type M and Type S errors by John Carlin and myself. Maxwell also points out that these alarms were raised repeatedly by earlier writers such as Cohen, Meehl, and Rozeboom, from the 1960s onwards.

In this post, I’ll first pull out some quotes from that 2004 paper that presage many of the issues of non-replications that we still are wrestle with today. Then I’ll discuss what’s been happening since 2004: what’s new in our thinking in the past fifteen years.

I’ll argue that, yes, everyone should’ve been listening to Cohen, Meehl, Roseboom, Maxwell, etc., all along; and also that we have been making some progress, that we have some new ideas that might help us move forward.

Part 1: They said it all before

Here’s a key quote from Maxwell (2004):

When power is low for any specific hypothesis but high for the collection of tests, researchers will usually be able to obtain statistically significant results, but which specific effects are statistically significant will tend to vary greatly from one sample to another, producing a pattern of apparent contradictions in the published literature.

I like this quote, as it goes beyond the usual framing in terms of “false positives” etc., to address the larger goals of a scientific research program.

Maxwell continues:

A researcher adopting such a strategy [focusing on statistically significant patterns in observed data] may have a reasonable probability of discovering apparent justification for recentering his or her article around a new finding. Unfortunately, however, this recentering may simply reflect sampling error . . . this strategy will inevitably produce positively biased estimates of effect sizes, accompanied by apparent 95% confidence intervals whose lower limit may fail to contain the value of the true population parameter 10% to 20% of the time.

He also slams deterministic reasoning:

The presence or absence of asterisks [indicating p-value thresholds] tends to convey an air of finality that an effect exists or does not exist . . .

And he mentions the “decline effect”:

Even a literal replication in a situation such as this would be expected to reveal smaller effect sizes than those originally reported. . . . the magnitude of effect sizes found in attempts to replicate can be much smaller than those originally reported, especially when the original research is based on small samples. . . . these smaller effect sizes might not even appear in the literature because attempts to replicate may result in nonsignificant results.

Classical multiple comparisons corrections won’t save you:

Some traditionalists might suggest that part of the problem . . . reflects capitalization on chance that could be reduced or even eliminated by requiring a statistically significant multivariate test. Figure 3 shows the result of adding this requirement. Although fewer studies will meet this additional criterion, the smaller subset of studies that would now presumably appear in the literature are even more biased . . .

This was a point raised a few years later by Vul et al. in their classic voodoo correlations paper.

Maxwell points out that meta-analysis of published summaries won’t solve the problem:

Including underpowered studies in meta-analyses leads to biased estimates of effect size whenever accessibility of studies depends at least in part on the presence of statistically significant results.

And this:

Unless psychologists begin to incorporate methods for increasing the power of their studies, the published literature is likely to contain a mixture of apparent results buzzing with confusion.

And the incentives:

Not only do underpowered studies lead to a confusing literature but they also create a literature that contains biased estimates of effect sizes. Furthermore . . . researchers may have felt little pressure to increase the power of their studies, because by testing multiple hypotheses, they often assured themselves of a reasonable probability of achieving a goal of obtaining at least one statistically significant result.

And he makes a point that I echoed many years later, regarding the importance of measurement and the naivety of researchers who think that the answer to all problems is to crank up the sample size:

Fortunately, an assumption that the only way to increase power is to increase sample size is almost always wrong. Psychologists are encouraged to familiarize themselves with additional methods for increasing power.

Part 2: (Some of) what’s new

So, Maxwell covered most of the ground in 2004. Here are a few things that I would add, from my standpoint nearly fifteen years later:

1. I think the concept of “statistical power” itself is a problem in that it implicitly treats the attainment of statistical significance as a goal. As Button et al. and others have discussed, low-power studies have a winner’s curse aspect, in that if you do a “power = 0.06” study and get lucky and find a statistical significant result, your estimate will be horribly exaggerated and likely in the wrong direction.

To put it another way, I fear that a typical well-intentioned researcher will want to avoid low-power studies—and, indeed, it’s trivial to talk yourself into thinking your study has high power, by just performing the power analysis using an overestimated effect size from the published literature—but will also think that a low power study is essentially a role of the dice. The implicit attitude is that in a study with, say, 10% power, you have a 10% chance of winning. But in such cases, a win is really a loss.

2. Variation in effects and context dependence. It’s not about identifying whether an effect is “true” or a “false positive.” Rather, let’s accept that in the human sciences there are no true zeroes, and relevant questions include the magnitude of effects, and how and where they vary. What I’m saying is: less “discovery,” more exploration and measurement.

3. Forking paths. If I were to rewrite Maxwell’s article today, I’d emphasize that the concern is not just multiple comparisons that have been performed, but also multiple potential comparisons. A researcher can walk through his or her data and only perform one or two analyses, but these analyses will be contingent on data, so that had the data been different, they would’ve been summarized differently. This allows the probability of finding statistical significance to approach 1, given just about any data (see, most notoriously, this story). In addition, I would emphasize that “researcher degrees of freedom” (in the words of Simmons, Nelson, and Simonsohn, 2011) arise not just in the choice of which of multiple coefficients to test in a regression, but also in which variables and interactions to include in the model, how to code data, and which data to exclude (see my above-linked paper with Loken for sevaral examples).

4. Related to point 2 above is that some effects are really really small. We all know about ESP, but there are also other tiny effects being studied. An extreme example is the literature on sex ratios. At one point in his article, referring to a proposal that psychologists gather data on a sample of a million people, Maxwell writes, “Thankfully, samples this large are unnecessary even to detect minuscule effect sizes.” Actually, if you’re studying variation in the human sex ratio, that’s about the size of sample you’d actually need! For the calculation, see pages 645-646 of this paper.

5. Flexible theories: The “goal of obtaining at least one statistically significant result” is only relevant because theories are so flexible that just about any comparison can be taken to be consistent with theory. Remember sociologist Jeremy Freese’s characterization of some hypotheses as “more vampirical than empirical—unable to be killed by mere evidence.”

6. Maxwell writes, “it would seem advisable to require that a priori power calculations be performed and reported routinely in empirical research.” Fine, but we can also do design analysis (our preferred replacement term for “power calculations”) after the data have come in and the analysis has been published. The purpose of a design calculation is not just to decide whether to do a study or to choose a sample size. It’s also to aid in interpretation of published results.

7. Measurement.

## Privacy

• In Germany, he points out, it starts out with the universal credit rating system known as a Schufa. Very much like its US counterpart FICO, Schufa is a private company that assesses the creditworthiness of about three-quarters of all Germans and over 5 million companies in the country. Anyone wanting to rent a house or loan money is required to produce their Schufa rating in Germany – or their FICO score in the US. Additionally, factors like “geo-scoring” can also lower your overall grade if you happen to live in a low-rent neighborhood, or even if a lot of your neighbors have bad credit ratings.

In other areas, German health insurers will offer you lower premiums if you don’t get sick as much. They may offer you even better premiums if you share data from your fitness-tracking device to show you’re doing your part to stay healthy. Anyone using websites like Amazon, eBay or Airbnb is asked to rate others and is rated themselves. Those who try to avoid being rated are looked at askance. An increasing number of consumers will be denied certain services or, say, mortgages if they don’t present some kind of rating.

## Tech

• We crack the Shufa!. When closed algorithms affect your life, one solution is to reverse-engineer them!

When applying for a loan, mobile phone contract, or even trying to rent an apartment in Germany, the Schufa score - Germany’s credit rating - is decisive. If you have a few „points“ too little, your application is refused. (Computer says „No“ to your new smartphone or apartment.) However, the calculation of these credit scores --done by the private Schufa company-- is fully intransparent. The formula is a trade secret, and as such not open to the public.

We want to change this intransparency with the project OpenSCHUFA. Together with AlgorithmWatch we want to reconstruct the Schufa algorithm with „reverse engineering“.

• Facial Recognition Is Accurate, if You’re a White Guy. The newest iteration of this well-known problem.

Facial recognition technology is improving by leaps and bounds. Some commercial software can now tell the gender of a person in a photograph.

When the person in the photo is a white man, the software is right 99 percent of the time.

But the darker the skin, the more errors arise — up to nearly 35 percent for images of darker skinned women, according to a new study that breaks fresh ground by measuring how the technology works on people of different races and gender.

• Tired of texting? Google tests robot to chat with friends for you. At some point we'll just set our phones on autopilot mode and leave them on the table. Which is not necessarily bad.

Are you tired of the constant need to tap on a glass keyboard just to keep up with your friends? Do you wish a robot could free you of your constant communication obligations via WhatsApp, Facebook or text messages? Google is working on an AI-based auto-reply system to do just that.

• Someone Is Sending Amazon Sex Toys to Strangers. Amazon Has No Idea How to Stop It. Another example of here is this algo, let's fool it now by feeding it some input we can control. Two relevant paragraphs:

Nikki’s story is part of a broiling internal mystery that is flummoxing Amazon, according to a source at the company: Someone is shipping out unsolicited products, frequently sex toys, to seemingly random customers, and the company does not yet know why they’re being purchased, and why they’re being shipped to people like Nikki.

[...]

Sources both in and out of Amazon have one theory. It’s called, in Amazon-speak, verified review hacking.

Amazon uses a review system that heavily weights “verified purchases”—reviews by users who have purchased a specific product through Amazon—over other reviews.

This could give sellers incentive to buy and ship their own products to strangers from dummy accounts. Those dummy accounts could then give the product a 5-star review and, in turn, help it surface higher in Amazon and Google searches.

Solutions to this have been proposed, however. On Dave Farber's IP list, this post appeared a few days ago:

The 'fix' is for Amazon to inject itself into the pipe in a way that the seller and vendor are unable to defeat. That is the 'card in the box.' Every order has an order number and Amazon can generate a URL through an Amazon URL shortening service that would indicate the order was fraudulent. If the recipient visits the URL you look for the review, if you find it you change it to "Fraudulent Reviewer" (you could remove it but public seller shaming is even better)

This is another example of how the information economy influences the goods economy. There the marginal value of the bogus review can be computed in terms of lifetime product sales affected.

• AI-Moderators Fighting AI-Generated Porn Is the Harbinger of the Fake News Apocalypse. It begins with porn, will continue with everything else. I was talking to some friends in Madrid a few months ago about this topic, before the deepfake porn was a thing, and we concluded that we are walking towards a society in which the only news you'll believe are those aligned to what you already believe. It's going to be fun for all the wrong reasons.

This sounds great in theory, but as Wired points out, there are a few scenarios where deepfakes will slip through the cracks. If someone makes a deepfake of a private citizen—think vindictive exes or harassers scraping someone’s private Facebook page—and no images or videos of them appear publicly online, these algorithms won’t be able to find videos, and will categorize it as the original.

Data Links is a periodic blog post published on Sundays (specific time may vary) which contains interesting links about data science, machine learning and related topics. You can subscribe to it using the general blog RSS feed or this one, which only contains these articles, if you are not interested in other things I might publish.

Have you read an article you liked and would you like to suggest it for the next issue? Just contact me!

### Packages for Getting Started with Time Series Analysis in R

(This article was first published on R – Mathew Analytics, and kindly contributed to R-bloggers)

A. Motivation

During the recent RStudio Conference, an attendee asked the panel about the lack of support provided by the tidyverse in relation to time series data. As someone who has spent the majority of their career on time series problems, this was somewhat surprising because R already has a great suite of tools for visualizing, manipulating, and modeling time series data. I can understand the desire for a ‘tidyverse approved’ tool for time series analysis, but it seemed like perhaps the issue was a lack of familiarity with the available toolage. Therefore, I wanted to put together a list of the packages and tools that I use most frequently in my work. For those unfamiliar with time series analysis, this could a good place to start investigating Rs current capabilities.

B. Background

Time series data refers to a sequence of measurements that are made over time at regular or irregular intervals with each observation being a single dimension. An example of low dimensional time series is daily wind temperature from 01/01/2001 through 12/31/2005. High dimensional time series is characterized by a larger number of observations, so an example could be the daily wind temperature from 01/01/1980 through 12/31/2010. In either case, the goal of the analysis could lead one to perform regression, clustering, forecasting, or even classification.

C. Data For Examples

To run the code in this post, you will need to access the following data through the unix terminal. It will download a csv file from the City of Chicago website that contains
information on reported incidents of crime that occurred in the city of Chicago from 2001 to present.

$wget –no-check-certificate –progress=dot https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD > chicago_crime_data.csv  Import the data into R and get the aggregate number of reported incidents of theft by day. library(data.table) dat = fread("chicago_crime_data.csv") colnames(dat) = gsub(" ", "_", tolower(colnames(dat))) dat[, date2 := as.Date(date, format="%m/%d/%Y")] mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)] mydat[1:6]  D. Data Representation The first set of packages that one should be aware of is related to data storage. One could use data frames, tibbles, or data tables, but there are already a number of data structures that are optimized for representing time series data. The fundamental time series object is “ts”. However, the “ts” class has a number of limitations, and so it is usually best to work with the extensible time series (“xts”) obect. D1. xts The xts package offers a number of great tools for data manipulation and aggregation. At it’s core is the xts object, which is essentially a matrix object that can represent time series data at different time increments. Xts is a subclass of the zoo object, and that provides it with a lot of functionality. Here are some functions in xts that are worth investigating: library(xts) # create a xts object mydat2 = as.xts(mydat) mydat2 plot.xts(mydat2) # filter by date mydat2["2015"] ## 2015 mydat2["201501"] ## Jan 2015 mydat2["20150101/20150105"] ## Jan 01 to Jan 05 2015 # replace all valuues from Aug 25 onwards with 0 mydat2["20170825/"] <- 0 mydat2["20170821/"] # get the last one month last(mydat2, "1 month") # get stats by time frame apply.monthly(mydat2, sum) apply.monthly(mydat2, quantile) period.apply(mydat2, endpoints(mydat2,on='months'), sum) period.apply(mydat2, endpoints(mydat2,on='months'), quantile)  E. Dates R has a maddening array of date and time classes. Be it yearmon, POSIXct, POSIXlt, chron, or something else, each has specific strengths and weaknesses. In general, I find myself using the lubridate package as it simplifies many of the complexities associated with date-times in R. E1. lubridate The lubridate package provides a lot of functionality for parsing and formatting dates, comparing different times, extracting the components of a date-time, and so forth. library(lubridate) ymd("2010-01-01") mdy("01-01-2010") ymd_h("2010-01-01 10") ymd_hm("2010-01-01 10:02") ymd_hms("2010-01-01 10:02:30")  F. Time Series Regression Distributed lag models (error correction models) are a core component of doing time series analysis. They are many instances where we want to regress an outcome variable at the current time against values of various regressors at current and previous times. dynlm and ardl (wrapper for dynlm) are solid for this type of analysis. Another common task when working with distributed lag models involves using dynamic simulations to understand estimated outcomes in different scenarios. dynsim provides a coherent solution for simulation and visualization of those estimated values of the target variable. F1. dynlm / ardl Here is a brief example of how dynlm can be utilized. In what follows, I have created a new variable and lagged it by one day. So the model attempts to regress incidents or reported theft based on the weather from the previous day. library(dynlm) mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)] mydat[, weather := sample(c(20:90), dim(mydat), replace=TRUE)] mydat[, weather_lag := shift(weather, 1, type = 'lag')] mod = dynlm(N ~ L(weather), data = mydat2) summary(mod)  F2. dynsim Here is a brief example of how dynlm can be utilized. In what follows, I have created a new variable and lagged it by one day. I’ve used the dynsim to product two dynamic simulations and plotted them. library(dynsim) mydat3 = mydat[1:10000] mod = lm(N ~ weather_lag, data = mydat3) Scen1 <- data.frame(weather_lag = min(mydat2$weather_lag, na.rm=T))
Scen2 <- data.frame(weather_lag = max(mydat2$weather_lag, na.rm=T)) ScenComb <- list(Scen1, Scen2) Sim1 <- dynsim(obj = mod, ldv = 'weather_lag', scen = ScenComb, n = 20) dynsimGG(Sim1)  D. Forecasting D1. forecast The forecast package is the most used package in R for time series forecasting. It contains functions for performing decomposition and forecasting with exponential smoothing, arima, moving average models, and so forth. For aggregated data that is fairly high dimensional, one of the techniques present in this package should provide an adequate forecasting model given that the assumptions hold. Here is a quick example of how to use the auto.arima function in R. In general, automatic forecasting tools should be used with caution, but it is a good place to explore time series data. library(forecast) mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)] fit = auto.arima(mydat[,.(N)]) pred = forecast(fit, 200) plot(pred)  D2. smooth The smooth package provides functions to perform even more variations of exponential smoothing, moving average models, and various seasonal arima techniques. The smooth and forecast package are usually more than adequate for most forecasting problems that pertain to high dimensional data. Here is a basic example that uses the automatic complex exponential smoothing function: library(smooth) mydat = dat[primary_type=="THEFT", .N, by=date2][order(date2)] fit = auto.ces(mydat[,N]) pred = forecast(fit, 200) plot(pred)  So for those of you getting introduced to the R programming language, these are a list extremely useful packages for time series analysis that you will want to get some exposure to. Have questions, comments, interesting consulting projects, or work that needs done, feel free to contact me at mathewanalytics@gmail.com To leave a comment for the author, please follow the link and comment on their blog: R – Mathew Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Magister Dixit “Indeed, in neural networks, we almost always choose our model as the output of running stochastic gradient descent. Appealing to linear models, we analyze how SGD acts as an implicit regularizer. For linear models, SGD always converges to a solution with small norm. Hence, the algorithm itself is implicitly regularizing the solution.” Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals ( 10 Nov 2016 ) Continue Reading… ### Document worth reading: “How deep learning works –The geometry of deep learning” Why and how that deep learning works well on different tasks remains a mystery from a theoretical perspective. In this paper we draw a geometric picture of the deep learning system by finding its analogies with two existing geometric structures, the geometry of quantum computations and the geometry of the diffeomorphic template matching. In this framework, we give the geometric structures of different deep learning systems including convolutional neural networks, residual networks, recursive neural networks, recurrent neural networks and the equilibrium prapagation framework. We can also analysis the relationship between the geometrical structures and their performance of different networks in an algorithmic level so that the geometric framework may guide the design of the structures and algorithms of deep learning systems. How deep learning works –The geometry of deep learning Continue Reading… ### Posters: SysML 2018 Conference There is currently the SysML 2018 Conference at Stanford and while the Live Stream is over, the poster session is taking place. Here are the presentations of each poster: Session I: 4:30pm - 6:00pm 1-1 A SIMD-MIMD Acceleration with Access-Execute Decoupling for Generative Adversarial Networks Amir Yazdanbakhsh, Kambiz Samadi, Hadi Esmaeilzadeh, Nam Sung Kim 1-2 Slice Finder: Automated Data Slicing for Model Interpretability Yeounoh Chung, Tim Kraska, Steven Euijong Whang, Neoklis Polyzotis 1-3 Data Infrastructure for Machine Learning Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, Martin Zinkevich 1-4 Speeding up ImageNet Training on Supercomputers Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, Kurt Keutzer 1-5 Aloha: A Machine Learning Framework for Engineers Ryan M Deak, Jonathan H Morra 1-6 Parameter Hub: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee, Arvind Krishnamurthy 1-7 Stitch-X: An Accelerator Architecture for Exploiting Unstructured Sparsity in Deep Neural Networks Ching-En Lee, Yakun Sophia Shao, Jie-Fang Zhang, Angshuman Parashar, Joel Emer, Stephen W. Keckler, Zhengya Zhang 1-8 DeepVizdom: Deep Interactive Data Exploration Carsten Binnig, Kristian Kersting, Alejandro Molina, Emanuel Zgraggen 1-9 Massively Parallel Video Networks João Carreira, Viorica Pătrăucean, Andrew Zisserman, Simon Osindero 1-10 EVA: An Efficient System for Exploratory Video Analysis Ziqiang Feng, Junjue Wang, Jan Harkes, Padmanabhan Pillai, Mahadev Satyanarayanan 1-11 Declarative Metadata Management: A Missing Piece in End-To-End Machine Learning Sebastian Schelter, Joos-Hendrik Böse, Johannes Kirschnick, Thoralf Klein, Stephan Seufert 1-12 Runway: machine learning model experiment management tool Jason Tsay, Todd Mummert, Norman Bobroff, Alan Braz, Peter Westerink, Martin Hirzel 1-13 STRADS-AP: Simplifying Distributed Machine Learning Programming Jin Kyu Kim, Garth A. Gibson, Eric P. Xing 1-14 A Deeper Look at FFT and Winograd Convolutions Aleksandar Zlateski, Zhen Jia, Kai Li, Fredo Durand 1-15 Efficient Deep Learning Inference on Edge Devices Ziheng Jiang, Tianqi Chen, Mu Li 1-16 On Human Intellect and Machine Failures: Troubleshooting Integrative Machine Learning Systems Besmira Nushi, Ece Kamar, Eric Horvitz, Donald Kossmann 1-17 DeepThin: A Self-Compressing Library for Deep Neural Networks Matthew Sotoudeh, Sara S. Baghsorkhi 1-18 MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Programmable Interconnects Hyoukjun Kwon, Ananda Samajdar, Tushar Krishna 1-19 On Machine Learning and Programming Languages Mike Innes, Stefan Karpinski, Viral Shah, David Barber, Pontus Stenetorp, Tim Besard, James Bradbury, Valentin Churavy, Simon Danisch, Alan Edelman, Jon Malmaud, Jarrett Revels, Deniz Yuret 1-20 "I Like the Way You Think!" - Inspecting the Internal Logic of Recurrent Neural Networks Thibault Sellam, Kevin Lin, Ian Yiran Huang, Carl Vondrick, Eugene Wu 1-21 Automatic Differentiation in Myia Olivier Breuleux, Bart van Merriënboer 1-22 TFX Frontend: A Graphical User Interface for a Production-Scale Machine Learning Platform Peter Brandt, Josh Cai, Tommie Gannert, Pushkar Joshi, Rohan Khot, Chiu Yuen Koo, Chenkai Kuang, Sammy Leong, Clemens Mewald, Neoklis Polyzotis, Herve Quiroz, Sudip Roy, Po-Feng Yang, James Wexler, Steven Euijong Whang 1-23 Learned Index Structures Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis Polyzotis 1-24 Towards Optimal Winograd Convolution on Manycores Zhen Jia, Aleksandar Zlateski, Fredo Durand, Kai Li 1-25 Mobile Machine Learning Hardware at ARM: A Systems-on-Chip (SoC) Perspective Yuhao Zhu, Matthew Mattina, Paul Whatmough 1-26 Deep Learning with Apache SystemML Niketan Pansare, Michael Dusenberry, Nakul Jindal, Matthias Boehm, Berthold Reinwald, Prithviraj Sen 1-27 Scalable Language Modeling: WikiText-103 on a Single GPU in 12 hours Stephen Merity, Nitish Shirish Keskar, James Bradbury, Richard Socher 1-28 PipeDream: Pipeline Parallelism for DNN Training Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Gregory R. Ganger, Phillip B. Gibbons 1-29 Efficient Mergeable Quantile Sketches using Moments Edward Gan, Jialin Ding, Peter Bailis 1-30 Systems Optimizations for Learning Certifiably Optimal Rule Lists Nicholas Larus-Stone, Elaine Angelino, Daniel Alabi, Margo Seltzer, Vassilios Kaxiras, Aditya Saligrama, Cynthia Rudin 1-31 Accelerating Model Search with Model Batching Deepak Narayanan, Keshav Santhanam, Matei Zaharia 1-32 Programming Language Support for Natural Language Interaction Alex Renda, Harrison Goldstein, Sarah Bird, Chris Quirk, Adrian Sampson 1-33 Factorized Deep Retrieval and Distributed TensorFlow Serving Xinyang Yi, Yi-Fan Chen, Sukriti Ramesh, Vinu Rajashekhar, Lichan Hong, Noah Fiedel, Nandini Seshadri, Lukasz Heldt, Xiang Wu, Ed H. Chi 1-34 Relaxed Pruning: Memory-Efficient LSTM Inference Engine by Limiting the Synaptic Connection Patterns Jaeha Kung, Junki Park, Jae-Joon Kim 1-35 Deploying Deep Ranking Models for Search Verticals Rohan Ramanath, Gungor Polatkan, Liqin Xu, Harold Lee, Bo Hu, Shan Zhou 1-36 Understanding the Error Structure as a Key to Regularize Convolutional Neural Networks Bilal Alsallakh, Amin Jourabloo, Mao Ye, Xiaoming Liu, Liu Ren 1-37 On Scale-out Deep Learning Training for Cloud and HPC Srinivas Sridharan, Karthikeyan Vaidyanathan, Dhiraj Kalamkar, Dipankar Das, Mikhail E. Smorkalov, Mikhail Shiryaev, Dheevatsa Mudigere, Naveen Mellempudi, Sasikanth Avancha, Bharat Kaul, Pradeep Dubey 1-38 In-network Neural Networks Giuseppe Siracusano, Roberto Bifulco 1-39 Compressing Deep Neural Networks with Probabilistic Data Structures Brandon Reagen, Udit Gupta, Robert Adolf, Michael M. Mitzenmacher, Alexander M. Rush, Gu-Yeon Wei, David Brooks 1-40 Greenhouse: A Zero-Positive Machine Learning System for Time-Series Anomaly Detection Tae Jun Lee, Justin Gottschlich, Nesime Tatbul, Eric Metcalf, Stan Zdonik 1-41 Precision and Recall for Range-Based Anomaly Detection Tae Jun Lee, Justin Gottschlich, Nesime Tatbul, Eric Metcalf, Stan Zdonik 1-42 Whetstone: An accessible, platform-independent method for training spiking deep neural networks for neuromorphic processors William M. Severa, Craig M. Vineyard, Ryan Dellana, James B. Aimone 1-43 SparseCore: An Accelerator for Structurally Sparse CNNs Sharad Chole, Ramteja Tadishetti, Sree Reddy 1-44 SGD on Random Mixtures: Private Machine Learning under Data Breach Threats Kangwook Lee, Kyungmin Lee, Hoon Kim, Changho Suh, Kannan Ramchandran 1-45 Towards High-Performance Prediction Serving Systems Yunseong Lee, Alberto Scolari, Matteo Interlandi, Markus Weimer, Byung-Gon Chun 1-46 Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization Fabian Pedregosa, Rémi Leblond, Simon Lacoste–Julien 1-47 Corpus Conversion Service: A machine learning platform to ingest documents at scale. Peter W J Staar, Michele Dolfi, Christoph Auer, Costas Bekas 1-48 Representation Learning for Resource Usage Prediction Florian Schmidt, Mathias Niepert, Felipe Huici 1-49 TVM: End-to-End Compilation Stack for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy 1-50 vectorflow: a minimalist neural-network library Benoît Rostykus, Yves Raimond 1-51 Learning Heterogeneous Cloud Storage Configuration for Data Analytics Ana Klimovic, Heiner Litz, Christos Kozyrakis 1-52 Salus: Fine-Grained GPU Sharing Among CNN Applications Peifeng Yu, Mosharaf Chowdhury 1-53 OpenCL Acceleration for TensorFlow Mehdi Goli, Luke Iwanski, John Lawson, Uwe Dolinsky, Andrew Richards 1-54 Picking Interesting Frames in Streaming Video Christopher Canel, Thomas Kim, Giulio Zhou, Conglong Li, Hyeontaek Lim, David G. Andersen, Michael Kaminsky, Subramanya R. Dulloor 1-55 SLAQ: Quality-Driven Scheduling for Distributed Machine Learning Haoyu Zhang, Logan Stafman, Andrew Or, Michael J. Freedman 1-56 A Comparison of Bottom-Up Approaches to Grounding for Templated Markov Random Fields Eriq Augustine, Lise Getoor 1-57 Growing Cache Friendly Decision Trees Niloy Gupta, Adam Johnston 1-58 Parallelizing Hyperband for Large-Scale Tuning Lisha Li, Kevin Jamieson, Afshin Rostamizadeh, Ameet Talwalkar 1-59 Towards Interactive Curation and Automatic Tuning of ML Pipelines Carsten Binnig, Benedetto Buratti, Yeounoh Chung, Cyrus Cousins, Dylan Ebert, Tim Kraska, Zeyuan Shang, Isabella Tromba, Eli Upfal, Linnan Wang, Robert Zeleznik, Emanuel Zgraggen Session II: 6:00pm - 7:30pm 2-1 Ternary Residual Networks Abhisek Kundu, Kunal Banerjee, Naveen Mellempudi, Dheevatsa Mudigere, Dipankar Das, Bharat Kaul, Pradeep Dubey 2-2 Neural Architect: A Multi-objective Neural Architecture Search with Performance Prediction Yanqi Zhou, Gregory Diamos 2-3 Federated Kernelized Multi-Task Learning Sebastian Caldas, Virginia Smith, Ameet Talwalkar 2-4 Materialization Trade-offs for Feature Transfer from Deep CNNs for Multimodal Data Analytics Supun Nakandala, Arun Kumar 2-5 Scaling HDBSCAN Clustering with kNN Graph Approximation Jacob Jackson, Aurick Qiao, Eric P. Xing 2-6 BlazeIt: An Optimizing Query Engine for Video at Scale Daniel Kang, Peter Bailis, Matei Zaharia 2-7 Time Travel based Feature Generation Kedar Sadekar, Hua Jiang 2-8 Controlling AI Engines in Dynamic Environments Nikita Mishra, Connor Imes, Henry Hoffmann, John D. Lafferty 2-9 Intermittent Deep Neural Network Inference Graham Gobieski, Nathan Beckmann, Brandon Lucia 2-10 CascadeCNN: Pushing the performance limits of quantisation Alexandros Kouris, Stylianos I. Venieris, Christos-Savvas Bouganis 2-11 Making Machine Learning Easy with Embeddings Dan Shiebler, Abhishek Tayal 2-12 CrossBow: Scaling Deep Learning on Multi-GPU Servers Alexandros Koliousis, Pijika Watcharapichat, Matthias Weidlich, Paolo Costa, Peter Pietzuch 2-13 Better Caching with Machine Learned Advice Thodoris Lykouris, Sergei Vassilvitskii 2-14 Large Model Support for Deep Learning in Caffe and Chainer Minsik Cho, Tung D. Le, Ulrich A. Finkler, Haruiki Imai, Yasushi Negishi, Taro Sekiyama, Saritha Vinod, Vladimir Zolotov, Kiyokuni Kawachiya, David S. Kung, Hillery C. Hunter 2-15 Learning Graph-based Cluster Scheduling Algorithms Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Mohammad Alizadeh 2-16 Intel nGraph: An Intermediate Representation, Compiler, and Executor for Deep Learning Scott Cyphers, Arjun K. Bansal, Anahita Bhiwandiwalla, Jayaram Bobba, Matthew Brookhart, Avijit Chakraborty, Will Constable, Christian Convey, Leona Cook, Omar Kanawi, Robert Kimball, Jason Knight, Nikolay Korovaiko, Varun Kumar, Yixing Lao, Christopher R. Lishka, Jaikrishnan Menon, Jennifer Myers, Sandeep Aswath Narayana, Adam Procter, Tristan J. Webb 2-17 Efficient Multi-Tenant Inference on Video using Microclassifiers Giulio Zhou, Thomas Kim, Christopher Canel, Conglong Li, Hyeontaek Lim, David G. Andersen, Michael Kaminsky, Subramanya R. Dulloor 2-18 Abstractions for Containerized Machine Learning Workloads in the Cloud Balaji Subramaniam, Niklas Nielsen, Connor Doyle, Ajay Deshpande, Jason Knight, Scott Leishman 2-19 Not All Ops Are Created Equal! Liangzhen Lai, Naveen Suda, Vikas Chandra 2-20 Robust Gradient Descent via Moment Encoding with LDPC Codes Raj Kumar Maity, Ankit Singh Rawat, Arya Mazumdar 2-21 Buzzsaw: A System for High Speed Feature Engineering Andrew Stanton, Liangjie Hong, Manju Rajashekhar 2-22 Predicate Optimization for a Visual Analytics Database Michael R. Anderson, Michael Cafarella, Thomas F. Wenisch, German Ros 2-23 Understanding the Limitations of Current Energy-Efficient Design Approaches for Deep Neural Networks Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, Vivienne Sze 2-24 Compiling machine learning programs via high-level tracing Roy Frostig, Matthew James Johnson, Chris Leary 2-25 Dynamic Stem-Sharing for Multi-Tenant Video Processing Angela Jiang, Christopher Canel, Daniel Wong, Michael Kaminsky, Michael A. Kozuch, Padmanabhan Pillai, David G. Andersen, Gregory R. Ganger 2-26 A Hierarchical Model for Device Placement Azalia Mirhoseini, Anna Goldie, Hieu Pham, Benoit Steiner, Quoc V. Le, Jeff Dean 2-27 Blink: A fast NVLink-based collective communication library Guanhua Wang, Amar Phanishayee, Shivaram Venkataraman, Ion Stoica 2-28 TOP: A Compiler-Based Framework for Optimizing Machine Learning Algorithms through Generalized Triangle Inequality Yufei Ding, Lin Ning, Hui Guang, Xipeng Shen, Madanlal Musuvathi, Todd Mytkowicz 2-29 UberShuffle: Communication-efficient Data Shuffling for SGD via Coding Theory Jichan Chung, Kangwook Lee, Ramtin Pedarsani, Dimitris Papailiopoulos, Kannan Ramchandran 2-30 Toward Scalable Verification for Safety-Critical Deep Networks Lindsey Kuper, Guy Katz, Justin Gottschlich, Kyle Julian, Clark Barrett, Mykel J. Kochenderfer 2-31 DAWNBench: An End-to-End Deep Learning Benchmark and Competition Cody Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun, Chris Ré, Matei Zaharia 2-32 Learning Network Size While Training with ShrinkNets Guillaume Leclerc, Raul Castro Fernandez, Samuel Madden 2-33 Have a Larger Cake and Eat It Faster Too: A Guideline to Train Larger Models Faster Newsha Ardalani, Joel Hestness, Gregory Diamos 2-34 Retrieval as a defense mechanism against adversarial examples in convolutional neural networks Junbo Zhao, Jinyang Li, Kyunghyun Cho 2-35 DNN-Train: Benchmarking and Analyzing Deep Neural Network Training Hongyu Zhu, Bojian Zheng, Bianca Schroeder, Gennady Pekhimenko, Amar Phanishayee 2-36 High Accuracy SGD Using Low-Precision Arithmetic and Variance Reduction (for Linear Models) Alana Marzoev, Christopher De Sa 2-37 SkipNet: Learning Dynamic Routing in Convolutional Networks Xin Wang, Fisher Yu, Zi-Yi Dou, Joseph E. Gonzalez 2-38 Memory-Efficient Data Structures for Learning and Prediction Damian Eads, Paul Baines, Joshua S. Bloom 2-39 Efficient and Programmable Machine Learning on Distributed Shared Memory via Static Analysis Jinliang Wei, Garth A. Gibson, Eric P. Xing 2-40 Parle: parallelizing stochastic gradient descent Pratik Chaudhari, Carlo Baldassi, Riccardo Zecchina, Stefano Soatto, Ameet Talwalkar, Adam Oberman 2-41 Optimal Message Scheduling for Aggregation Leyuan Wang, Mu Li, Edo Liberty, Alex J. Smola 2-42 Analog electronic deep networks for fast and efficient inference Jonathan Binas, Daniel Neil, Giacomo Indiveri, Shih-Chii Liu, Michael Pfeiffer 2-43 Network Evolution for DNNs Michael Alan Chang, Aurojit Panda, Domenic Bottini, Lisa Jian, Pranay Kumar, Scott Shenker 2-44 BinaryCmd: Keyword Spotting with deterministic binary basis Javier Fernández-Marqués, Vincent W.-S. Tseng, Sourav Bhattachara, Nicholas D. Lane 2-45 YellowFin: Adaptive Optimization for (A)synchronous Systems Jian Zhang, Ioannis Mitliagkas 2-46 GPU-acceleration for Large-scale Tree Boosting Huan Zhang, Si Si, Cho-Jui Hsieh 2-47 Treelite: toolbox for decision tree deployment Hyunsu Cho, Mu Li 2-48 On Importance of Execution Ordering in Graph-Based Distributed Machine Learning Systems Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, Roy Campbell 2-49 Draco: Robust Distributed Training against Adversaries Lingjiao Chen, Hongyi Wang, Dimitris Papailiopoulos 2-50 Clustering System Data using Aggregate Measures Johnnie C-N. Chang, Robert H-J. Chen, Jay Pujara, Lise Getoor 2-51 A Framework for Searching a Predictive Model Yoshiki Takahashi, Masato Asahara, Kazuyuki Shudo 2-52 Distributed Placement of Machine Learning Operators for IoT applications spanning Edge and Cloud Resources Tarek Elgamal, Atul Sandur, Klara Nahrstedt, Gul Agha 2-53 Finding Heavily-Weighted Features with the Weight-Median Sketch Kai Sheng Tai, Vatsal Sharan, Peter Bailis, Gregory Valiant 2-54 Flexible Primitives for Distributed Deep Learning in Ray Yaroslav Bulatov, Robert Nishihara, Philipp Moritz, Melih Elibol, Ion Stoica, Michael I. Jordan 2-55 BLAS-on-flash: an alternative for training large ML models? Suhas Jayaram Subramanya, Srajan Garg, Harsha Vardhan Simhadri 2-56 Treating Machine Learning Algorithms As Declaratively Specified Circuits Jason Eisner, Nathaniel Wesley Filardo 2-57 Tasvir: Distributed Shared Memory for Machine Learning Amin Tootoonchian, Aurojit Panda, Aida Nematzadeh, Scott Shenker Rest of the program: • 9:00 am - 9:15 am Opening Remarks: Ameet Talwalkar • Session I (moderator: Virginia Smith) • 9:15 am - 9:55 am Invited talk: Michael I. Jordan • 9:55 am - 10:05 am Contributed talk: TVM: End-to-End Compilation Stack for Deep Learning, Tianqi Chen • 10:05 am - 10:15 am Contributed talk: Robust Gradient Descent via Moment Encoding with LDPC Codes, Arya Mazumdar • 10:15 am - 10:25 am Contributed talk: Analog electronic deep networks for fast and efficient inference, Jonathan Binas • 10:25 am - 10:50 pm Coffee Break • Session II (moderator: Virginia Smith) • 10:50 am - 11:30 am Invited talk: Hardware for Deep Learning, Bill Dally • 11:30 am - 11:40 am Contributed talk: YellowFin: Adaptive Optimization for (A)synchronous Systems, Ioannis Mitliagkas • 11:40 am - 12:20 am Invited talk: Security, Privacy, and Democratization: Challenges & Future Directions for ML Systems beyond Scalability, Dawn Song • 12:20 pm - 1:30 pm Lunch • Session III (moderator: Sarah Bird) • 1:30 pm - 2:10 pm Invited talk: Structured ML: Opportunities and Challenges for the SysML Community, Lise Getoor • 2:10 pm - 2:20 pm Contributed talk: Understanding the Limitations of Current Energy-Efficient Design Approaches for Deep Neural Networks, Vivienne Sze • 2:20 am - 2:30 am Contributed talk: Towards High-Performance Prediction Serving Systems, Matteo Interlandi • 2:30 pm - 2:55 pm Coffee Break • Session IV (moderator: Sarah Bird) • 2:55 pm - 3:05 pm Contributed talk: "I Like the Way You Think!" - Inspecting the Internal Logic of Recurrent Neural Networks, Thibault Sellam • 3:05 pm - 3:45 pm Invited talk: Systems and Machine Learning Symbiosis, Jeff Dean • 3:45 pm - 4:00 pm Closing Remarks: Matei Zaharia Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there ! Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin. Continue Reading… ### R Packages worth a look Regularized Greedy Forest (RGF) Regularized Greedy Forest wrapper of the ‘Regularized Greedy Forest’ <https://…/rgf_python> ‘python’ package, which also includes a Multi-core implementation (FastRGF) <https://…/fast_rgf>. Optimal Classification Roll Call Analysis Software (oc) Estimates optimal classification (Poole 2000) <doi:10.1093/oxfordjournals.pan.a029814> scores from roll call votes supplied though a ‘rollcall’ object from package ‘pscl’. Comprehensive Library for Working with Missing (NA) Values in Vectors (na.tools) This comprehensive toolkit provide a consistent and extensible framework for working with missing values in vectors. The companion package ‘tidyimpute’ provides similar functionality for list-like and table-like structures). Functions exist for detection, removal, replacement, imputation, recollection, etc. of ‘NAs’. Continue Reading… ## February 17, 2018 ### Book Memo: “Agile Data Science 2.0”  Building Full-Stack Data Analytics Applications with Spark Data science teams looking to turn research into useful analytics applications require not only the right tools, but also the right approach if they’re to succeed. With the revised second edition of this hands-on guide, up-and-coming data scientists will learn how to use the Agile Data Science development methodology to build data applications with Python, Apache Spark, Kafka, and other tools. Author Russell Jurney demonstrates how to compose a data platform for building, deploying, and refining analytics applications with Apache Kafka, MongoDB, ElasticSearch, d3.js, scikit-learn, and Apache Airflow. You’ll learn an iterative approach that lets you quickly change the kind of analysis you’re doing, depending on what the data is telling you. Publish data science work as a web application, and affect meaningful change in your organization. Continue Reading… ### R Tip: Use qc() For Fast Legible Quoting Here is an R tip. Need to quote a lot of names at once? Use qc(). This is particularly useful in selecting columns from data.frames: library("wrapr") # get qc() definition head(mtcars[, qc(mpg, cyl, wt)]) # mpg cyl wt # Mazda RX4 21.0 6 2.620 # Mazda RX4 Wag 21.0 6 2.875 # Datsun 710 22.8 4 2.320 # Hornet 4 Drive 21.4 6 3.215 # Hornet Sportabout 18.7 8 3.440 # Valiant 18.1 6 3.460  Or even to install many packages at once: install.packages(qc(vtreat, cdata, WVPlots)) # shorter than the alternative: # install.packages(c("vtreat", "cdata", "WVPlots"))  Continue Reading… ### Saturday Morning Videos: IPAM Workshop on New Deep Learning Techniques Yann mentioned it on its twitter feed, the videos and slides of the IPAM workshop on New Deep Learning Techniques is out. Enjoy ! Samuel Bowman (New York University) Toward natural language semantics in learned representations Emily Fox (University of Washington) Interpretable and Sparse Neural Network Time Series Models for Granger Causality Discovery Ellie Pavlick (University of Pennsylvania) Should we care about linguistics? Leonidas Guibas (Stanford University) Knowledge Transport Over Visual Data Yann LeCun (New York University) Public Lecture: Deep Learning and the Future of Artificial Intelligence Alán Aspuru-Guzik (Harvard University) Generative models for the inverse design of molecules and materials Daniel Rueckert (Imperial College) Deep learning in medical imaging: Techniques for image reconstruction, super-resolution and segmentation Kyle Cranmer (New York University) Deep Learning in the Physical Sciences Stéphane Mallat (École Normale Supérieure) Deep Generative Networks as Inverse Problems Michael Elad (Technion - Israel Institute of Technology) Sparse Modeling in Image Processing and Deep Learning Yann LeCun (New York University) Public Lecture: AI Breakthroughs & Obstacles to Progress, Mathematical and Otherwise Xavier Bresson (Nanyang Technological University, Singapore) Convolutional Neural Networks on Graphs Federico Monti (Universita della Svizzera Italiana) Deep Geometric Matrix Completion: a Geometric Deep Learning approach to Recommender Systems Joan Bruna (New York University) On Computational Hardness with Graph Neural Networks Jure Leskovec (Stanford University) Large-scale Graph Representation Learning Arthur Szlam (Facebook) Composable planning with attributes Yann LeCun (New York University) A Few (More) Approaches to Unsupervised Learning Sanja Fidler (University of Toronto) Teaching Machines with Humans in the Loop Raquel Urtasun (University of Toronto) Deep Learning for Self-Driving Cars Pratik Chaudhari (University of California, Los Angeles (UCLA)) Unraveling the mysteries of stochastic gradient descent on deep networks Stefano Soatto (University of California, Los Angeles (UCLA)) Emergence Theory of Deep Learning Tom Goldstein (University of Maryland) What do neural net loss functions look like? Stanley Osher (University of California, Los Angeles (UCLA)) New Techniques in Optimization and Their Applications to Deep Learning and Related Inverse Problems Michael Bronstein (USI Lugano, Switzerland) Deep functional maps: intrinsic structured prediction for dense shape correspondence Sainbayar Sukhbaatar (New York University) Deep Architecture for Sets and Its Application to Multi-agent Communication Zuowei Shen (National University of Singapore) Deep Learning: Approximation of functions by composition Wei Zhu (Duke University) LDMnet: low dimensional manifold regularized neural networks Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there ! Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin. Continue Reading… ### If you did not already know Graph Neural Network (GNN) Many underlying relationships among data in several areas of science and engineering, e.g., computer vision, molecular chemistry, molecular biology, pattern recognition, and data mining, can be represented in terms of graphs. In this paper, we propose a new neural network model, called graph neural network (GNN) model, that extends existing neural network methods for processing the data represented in graph domains. This GNN model, which can directly process most of the practically useful types of graphs, e.g., acyclic, cyclic, directed, and undirected, implements a function tau(G,n) isin IRm that maps a graph G and one of its nodes n into an m-dimensional Euclidean space. A supervised learning algorithm is derived to estimate the parameters of the proposed GNN model. The computational cost of the proposed algorithm is also considered. Some experimental results are shown to validate the proposed learning algorithm, and to demonstrate its generalization capabilities. … Leave-p-Out Cross Validation (LpOCV) As the name suggests, leave-p-out cross-validation (LpO CV) involves using p observations as the validation set and the remaining observations as the training set. This is repeated on all ways to cut the original sample on a validation set of p’ observations and a training set. LpO cross-validation requires to learn and validate times (where n is the number of observation in the original sample). So as soon as n is quite big it becomes impossible to calculate. … Apache Tika The Apache Tika toolkit detects and extracts metadata and text content from various documents – from PPT to CSV to PDF – using existing parser libraries. Tika unifies these parsers under a single interface to allow you to easily parse over a thousand different file types. Tika is useful for search engine indexing, content analysis, translation, and much more. … Continue Reading… ### Workers Should Have Their Fingers Crossed for a Market Downturn Who cares if the stock market tanks? No, really. I’m wondering who actually has a stake in the levels of the stock market. The average person doesn’t have much savings, including retirement savings which is the standard way to have a direct stake in the market. In fact a majority of Americans, and more than that if you consider minorities, have less than$1000 put away for retirement. They might care about the few hundred dollars they have, but it’s really not much directly at stake, and it’s a long term abstract investment if it even exists.

For that matter, truly rich people have investment advisors that diversify their positions by using bonds, hedge funds, and so on to make their bet more market neutral. Plus, they have plenty of assets, so to the extent that the market goes down by a bit won’t overly concern them.

That leaves the well off but not rich people who are adequately long the market to care what it does, and still their stake is mostly via retirement savings. I’m not sure how much they represent as a percentage of the population, but it’s fair to say the average member of the population don’t really care about a market fall.

It’s been a long time since the market has been a good proxy for the economy as a whole. Thinking used to be that if corporations made more money, at least if it came from higher productivity, then some portion of that would be distributed to workers. But it was long ago that productivity decoupled from the median wage.

In fact, it’s become just the opposite: good news for workers means bad news for the market. That became clear recently when a substantial rise in wages led to a drop in the market. The argument went something along the lines of higher wages will cause inflation and then interest rates yadda yadda, but the bottom line is that shareholders have gotten used to keeping all the corporate profits.

Actually, this anti-correlation between the market and worker interests has actually been true for quite a while. The tax bill, which heavily privileges stockholders over wage earners, was slowly baked into the stock market as it became increasingly clear it would pass. In other words, good news for the market has meant bad news for workers for the past year and a half. It’s also why Davos loved Trump: he gave out goodies to rich people with the abstract promise that this will end up in the pockets of workers.

Of course, there are pieces of news that would be bad for both the workers and for the market, like a recession, and there are potential turns of events that would be good for everyone, like exciting new industries that hire lots of people. But for the foreseeable future, I’m thinking that workers should be cheering a tanking market.

### Bob likes the big audience

In response to a colleague who was a bit scared of posting some work up on the internet for all to see, Bob Carpenter writes:

I like the big audience for two reasons related to computer science principles.

The first benefit is the same reason it’s scary. The big audience is likely to find flaws. And point them out. In public! The trick is to get over feeling bad about it and realize that it’s a super powerful debugging tool for ideas. Owning up to being wrong in public is also very liberating. Turns out people don’t hold it against you at all (well, maybe they would if you were persistently and unapologetically wrong). It also provides a great teaching opportunity—if a postdoc is confused about something in their speciality, chances are that a lot of others are confused, too.

In programming, the principle is that you want routines to fail early. You want to inspect user input and if there’s a fatal problem with it that can be detected, fail right away and let the user know what the error is. Don’t fire up the algorithm and report some deeply nested error in a Fortran matrix algorithm. Something not being shot down on the blog is like passing that validation. It gives you confidence going on.

The second benefit is the same as in any writing, only the stakes are higher with the big audience. When you write for someone else, you’re much more self critical. The very act of writing can uncover problems or holes in your reasoning. I’ve started several blog posts and papers and realized at some point as I fleshed out an argument that I was missing a fundamental point.

There’s a principle in computer science called the rubber ducky.

One of the best ways to debug is to have a colleague sit down and let you explain your bug to them. Very often halfway through the explanation you find your own bug and the colleague never even understands the problem. The blog audience is your rubber ducky.

The term itself if a misnomer in that it only really works if the rubbery ducky can understand what you’re saying. They don’t need to understand it, just be capable of understanding it. Like the no free lunch principle, there are no free pair programmers.

The third huge benefit is that other people have complementary skills and knowledge. They point out connections and provide hints that can prove invaluable. We found out about automatic differentiation through a comment on the blog to a post where I was speculating about how we could calculate gradients of log densities in C++.

I guess there’s a computer science principle there, too—modularity. You can bring in whole modules of knowledge, like we did with autodiff.

I agree. It’s all about costs and benefits. The cost of an error is low if discovered early. You want to stress test, not to hide your errors and wait for them to be discovered later.

The post Bob likes the big audience appeared first on Statistical Modeling, Causal Inference, and Social Science.

### Of rabbits and cannons

When does it make sense to shoot a rabbit with a cannon?

I was reminded of this question recently when I happened to come across this exchange in the comments section from a couple years ago, in the context of the finding patterns in the frequencies of births on different days:

Rahul: Yes, inverting a million element matrix for this sort of problem does have the feel of killing mice with cannon.

Andrew: In many areas of research, you start with the cannon. Once the mouse is dead and you can look at it carefully from all angles, you can design an effective mousetrap. Red State Blue State went the same way: we found the big pattern only after fitting a multilevel model, but once we knew what we were looking for, it was possible to see it in the raw data.

The post Of rabbits and cannons appeared first on Statistical Modeling, Causal Inference, and Social Science.

### The curse of dimensionality and finite vs. asymptotic convergence results

Related to our (Aki, Andrew, Jonah) Pareto smoothed importance sampling paper I (Aki) received a few times a comment that why bother with Pareto smoothing as you can always choose the proposal distribution so that importance ratios are bounded and then central limit theorem holds. The curse of dimensionality here is that the papers they refer used small dimensional experiments and the results do not work so well in high dimensions. Readers of this blog should not be surprised that things look not the same in high dimensions. In high dimensions the probability mass is far from the mode. It’s spread thinly in surface of high-dimensional sphere. See, e.g. Mike’s paper Bob’s case study, and blog post

In importance sampling one working solution in low dimensions is to use mixture of two proposals. One component tries to match the mode, and the other takes care that tails go down slower than the tails of the target ensuring bounded ratios. In the following I only look at the behavior with one component which has thicker tail and thus importance ratios are bounded (but I have made the similar experiment with mixture, too).

The target distribution is multidimensional normal distribution with zero mean and unit covariance matrix. In the first case the proposal distribution is also normal, but with scale 1.1 in each dimension. The scale is just slightly larger than for the proposal and we are often lucky if we can guess the scale of proposal with 10% accuracy. I draw 100000 draws from the proposal distribution.

The following figure shows when the number of dimensions go from 1 to 1024.

The upper subplot shows the estimated effective sample size. By D=512 importance weighted 100000 draws have only a few practically non-zero weights. The middle subplot shows the convergence rate compared to independent sampling, ie, how fast the variance goes down. By D=1024 the convergence rate has dramatically dropped and getting any improvement in the accuracy requires more and more draws. The bottom subplot shows Pareto khat diagnostic (see the paper for details). Dashed line is k=0.5, which the limit for variance being finite and dotted line is our suggestion for practically useful performance when using PSIS. But how can khat be larger than 0.5 when we have bounded weights! Central limit theorem has not failed here, but we have just not reach yet the asymptotic regime to get CLT to kick in!

The next plot shows more information what happens with D=1024.

Since humans are lousy in looking at 1024 dimensional plots the top subplot shows the 1 dimensional marginal density of the target (blue) and the proposal (red) densities of the distance from the origo r=sqrt(sum_{d=1}^D x_d). The proposal density has only 1.1 larger scale than the target, but most of the draws from the proposal are away from the typical set of the target! The vertical dashed line shows 1e-6 quantile of the proposal, ie when we draw 100000 draws, 90% of time we don’t get any draws from there. The middle subplot shows the importance ratio function, and we can see that the highest value is at 0, but that value is larger than 2*10^42! That’s a big number. The bottom subplot scales the y axis show that we see importance ratios near that 1e-6 quantile. Check the y-axis: it’s still from 0 to 1e6. So if we are lucky we may get a draw below the dashed line, but then it’s likely to get all the weight. The importance function is practically as steep everywhere where we can get draws in a time of the age of the universe. 1e-80 quantile is at 21.5 (1e80 is the estimated number of atoms in the visible universe). and it’s still far away from the region where the boundedness of the importance ratio function starts to affect.

I have more similar plots with thick tailed Student’s t, mixture of proposals etc, but I’ll save you from more plots. As long as there is some difference in target and proposal taking the number of dimensions to high enough, IS and PSIS break (PSIS giving slight improvement in the performance, and more importantly can diagnose the problem and improves the Monte Carlo estimate).

In addition that we need to take into account that many methods which work in small dimensions can break in high dimensions, we need to focus more on finite case performance. As seen here it doesn’t help us that CLT holds if we never can reach that asymptotic regime (same as why Metropolis algorithm in high dimensions may require close to infinite time to produce useful results). Pareto diagnostics has been empirically shown to provide very good finite case convergence rate estimates which also match some theoretical bounds.

### Distilled News

One of the first lessons you’ll receive in machine learning is that there are two broad categories: supervised and unsupervised learning. Supervised learning is usually explained as the one to which you provide the correct answers, training data, and the machine learns the patterns to apply to new data. Unsupervised learning is (apparently) where the machine figures out the correct answer on its own. Supposedly, unsupervised learning can discover something new that has not been found in the data before. Supervised learning cannot do that.
So you love reading but can’t afford to splurge too much money on books? Quite a lot of the data science and machine learning books out there fall in the expensive category. It’s only fair, given how much thought and effort goes into writing and publishing them. But there are a few kind souls who have made their work available to everyone..for free! If you want to become a data scientist or AI Engineer – you couldn’t have asked for more. Here is a collection of 10 such free ebooks on machine learning. We begin the list by going from the basics of statistics, then machine learning foundations and finally advanced machine learning.
Learn how to build your own recommendation engine with the help of Python, from basic models to content-based and collaborative filtering recommender systems.

### If you did not already know

Graph-Sparse Logistic Regression
We introduce Graph-Sparse Logistic Regression, a new algorithm for classification for the case in which the support should be sparse but connected on a graph. We val- idate this algorithm against synthetic data and benchmark it against L1-regularized Logistic Regression. We then explore our technique in the bioinformatics context of proteomics data on the interactome graph. We make all our experimental code public and provide GSLR as an open source package. …

Partial Transfer Learning
Adversarial learning has been successfully embedded into deep networks to learn transferable features, which reduce distribution discrepancy between the source and target domains. Existing domain adversarial networks assume fully shared label space across domains. In the presence of big data, there is strong motivation of transferring both classification and representation models from existing big domains to unknown small domains. This paper introduces partial transfer learning, which relaxes the shared label space assumption to that the target label space is only a subspace of the source label space. Previous methods typically match the whole source domain to the target domain, which are prone to negative transfer for the partial transfer problem. We present Selective Adversarial Network (SAN), which simultaneously circumvents negative transfer by selecting out the outlier source classes and promotes positive transfer by maximally matching the data distributions in the shared label space. Experiments demonstrate that our models exceed state-of-the-art results for partial transfer learning tasks on several benchmark datasets. …

Information Retrieval (IR)
Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing. Automated information retrieval systems are used to reduce what has been called “information overload”. Many universities and public libraries use IR systems to provide access to books, journals and other documents. Web search engines are the most visible IR applications. An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In information retrieval a query does not uniquely identify a single object in the collection. Instead, several objects may match the query, perhaps with different degrees of relevancy. …

### Whats new on arXiv

Hyperparameters are critical in machine learning, as different hyperparameters often result in models with significantly different performance. Hyperparameters may be deemed confidential because of their commercial value and the confidentiality of the proprietary algorithms that the learner uses to learn them. In this work, we propose attacks on stealing the hyperparameters that are learned by a learner. We call our attacks hyperparameter stealing attacks. Our attacks are applicable to a variety of popular machine learning algorithms such as ridge regression, logistic regression, support vector machine, and neural network. We evaluate the effectiveness of our attacks both theoretically and empirically. For instance, we evaluate our attacks on Amazon Machine Learning. Our results demonstrate that our attacks can accurately steal hyperparameters. We also study countermeasures. Our results highlight the need for new defenses against our hyperparameter stealing attacks for certain machine learning algorithms.
Scaling Bayesian optimization to high dimensions is challenging task as the global optimization of high-dimensional acquisition function can be expensive and often infeasible. Existing methods depend either on limited active variables or the additive form of the objective function. We propose a new method for high-dimensional Bayesian optimization, that uses a dropout strategy to optimize only a subset of variables at each iteration. We derive theoretical bounds for the regret and show how it can inform the derivation of our algorithm. We demonstrate the efficacy of our algorithms for optimization on two benchmark functions and two real-world applications- training cascade classifiers and optimizing alloy composition.
Recent works investigated the generalization properties in deep neural networks (DNNs) by studying the Information Bottleneck in DNNs. However, the measurement of the mutual information (MI) is often inaccurate due to the density estimation. To address this issue, we propose to measure the dependency instead of MI between layers in DNNs. Specifically, we propose to use Hilbert-Schmidt Independence Criterion (HSIC) as the dependency measure, which can measure the dependence of two random variables without estimating probability densities. Moreover, HSIC is a special case of the Squared-loss Mutual Information (SMI). In the experiment, we empirically evaluate the generalization property using HSIC in both the reconstruction and prediction auto-encoding (AE) architectures.
Random walk based distance measures for graphs such as commute-time distance are useful in a variety of graph algorithms, such as clustering, anomaly detection, and creating low dimensional embeddings. Since such measures hinge on the spectral decomposition of the graph, the computation becomes a bottleneck for large graphs and do not scale easily to graphs that cannot be loaded in memory. Most existing graph mining libraries for large graphs either resort to sampling or exploit the sparsity structure of such graphs for spectral analysis. However, such methods do not work for dense graphs constructed for studying pairwise relationships among entities in a data set. Examples of such studies include analyzing pairwise locations in gridded climate data for discovering long distance climate phenomena. These graphs representations are fully connected by construction and cannot be sparsified without loss of meaningful information. In this paper we describe CADDeLaG, a framework for scalable computation of commute-time distance based anomaly detection in large dense graphs without the need to load the entire graph in memory. The framework relies on Apache Spark’s memory-centric cluster-computing infrastructure and consists of two building blocks: a decomposable algorithm for commute time distance computation and a distributed linear system solver. We illustrate the scalability of CADDeLaG and its dependency on various factors using both synthetic and real world data sets. We demonstrate the usefulness of CADDeLaG in identifying anomalies in a climate graph sequence, that have been historically missed due to ad hoc graph sparsification and on an election donation data set.
Existing multi-agent reinforcement learning methods are limited typically to a small number of agents. When the agent number increases largely, the learning becomes intractable due to the curse of the dimensionality and the exponential growth of user interactions. In this paper, we present Mean Field Reinforcement Learning where the interactions within the population of agents are approximated by those between a single agent and the average effect from the overall population or neighboring agents; the interplay between the two entities is mutually reinforced: the learning of the individual agent’s optimal policy depends on the dynamics of the population, while the dynamics of the population change according to the collective patterns of the individual policies. We develop practical mean field Q-learning and mean field Actor-Critic algorithms and analyze the convergence of the solution. Experiments on resource allocation, Ising model estimation, and battle game tasks verify the learning effectiveness of our mean field approaches in handling many-agent interactions in population.
In this paper we propose a new algorithm for streaming principal component analysis. With limited memory, small devices cannot store all the samples in the high-dimensional regime. Streaming principal component analysis aims to find the $k$-dimensional subspace which can explain the most variation of the $d$-dimensional data points that come into memory sequentially. In order to deal with large $d$ and large $N$ (number of samples), most streaming PCA algorithms update the current model using only the incoming sample and then dump the information right away to save memory. However the information contained in previously streamed data could be useful. Motivated by this idea, we develop a new streaming PCA algorithm called History PCA that achieves this goal. By using $O(Bd)$ memory with $B\approx 10$ being the block size, our algorithm converges much faster than existing streaming PCA algorithms. By changing the number of inner iterations, the memory usage can be further reduced to $O(d)$ while maintaining a comparable convergence speed. We provide theoretical guarantees for the convergence of our algorithm along with the rate of convergence. We also demonstrate on synthetic and real world data sets that our algorithm compares favorably with other state-of-the-art streaming PCA methods in terms of the convergence speed and performance.
In network science, there is often the need to sort the graph nodes. While the sorting strategy may be different, in general sorting is performed by exploiting the network structure. In particular, the metric PageRank has been used in the past decade in different ways to produce a ranking based on how many neighbors point to a specific node. PageRank is simple, easy to compute and effective in many applications, however it comes with a price: as PageRank is an application of the random walker, the arc weights need to be normalized. This normalization, while necessary, introduces a series of unwanted side-effects. In this paper, we propose a generalization of PageRank named Black Hole Metric which mitigates the problem. We devise a scenario in which the side-effects are particularily impactful on the ranking, test the new metric in both real and synthetic networks, and show the results.
The discovery of time series motifs has emerged as one of the most useful primitives in time series data mining. Researchers have shown its utility for exploratory data mining, summarization, visualization, segmentation, classification, clustering, and rule discovery. Although there has been more than a decade of extensive research, there is still no technique to allow the discovery of time series motifs in the presence of missing data, despite the well-documented ubiquity of missing data in scientific, industrial, and medical datasets. In this work, we introduce a technique for motif discovery in the presence of missing data. We formally prove that our method is admissible, producing no false negatives. We also show that our method can piggy-back off the fastest known motif discovery method with a small constant factor time/space overhead. We will demonstrate our approach on diverse datasets with varying amounts of missing data
Convolutional operator learning is increasingly gaining attention in many signal processing and computer vision applications. Learning kernels has mostly relied on so-called local approaches that extract and store many overlapping patches across training signals. Due to memory demands, local approaches have limitations when learning kernels from large datasets — particularly with multi-layered structures, e.g., convolutional neural network (CNN) — and/or applying the learned kernels to high-dimensional signal recovery problems. The so-called global approach has been studied within the ‘synthesis’ signal model, e.g., convolutional dictionary learning, overcoming the memory problems by careful algorithmic designs. This paper proposes a new convolutional analysis operator learning (CAOL) framework in the global approach, and develops a new convergent Block Proximal Gradient method using a Majorizer (BPG-M) to solve the corresponding block multi-nonconvex problems. To learn diverse filters within the CAOL framework, this paper introduces an orthogonality constraint that enforces a tight-frame (TF) filter condition, and a regularizer that promotes diversity between filters. Numerical experiments show that, for tight majorizers, BPG-M significantly accelerates the CAOL convergence rate compared to the state-of-the-art method, BPG. Numerical experiments for sparse-view computational tomography show that CAOL using TF filters significantly improves reconstruction quality compared to a conventional edge-preserving regularizer. Finally, this paper shows that CAOL can be useful to mathematically model a CNN, and the corresponding updates obtained via BPG-M coincide with core modules of the CNN.
We consider the problem of estimating the differences between two causal directed acyclic graph (DAG) models given i.i.d. samples from each model. This is of interest for example in genomics, where large-scale gene expression data is becoming available under different cellular contexts, for different cell types, or disease states. Changes in the structure or edge weights of the underlying causal graphs reflect alterations in the gene regulatory networks and provide important insights into the emergence of a particular phenotype. While the individual networks are usually very large, containing high-degree hub nodes and thus difficult to learn, the overall change between two related networks can be sparse. We here provide the first provably consistent method for directly estimating the differences in a pair of causal DAGs without separately learning two possibly large and dense DAG models and computing their difference. Our two-step algorithm first uses invariance tests between regression coefficients of the two data sets to estimate the skeleton of the difference graph and then orients some of the edges using invariance tests between regression residual variances. We demonstrate the properties of our method through a simulation study and apply it to the analysis of gene expression data from ovarian cancer and during T-cell activation.
We study optimal covariate balance for causal inferences from observational data when rich covariates and complex relationships necessitate flexible modeling with neural networks. Standard approaches such as propensity weighting and matching/balancing fail in such settings due to miscalibrated propensity nets and inappropriate covariate representations, respectively. We propose a new method based on adversarial training of a weighting and a discriminator network that effectively addresses this methodological gap. This is demonstrated through new theoretical characterizations of the method as well as empirical results using both fully connected architectures to learn complex relationships and convolutional architectures to handle image confounders, showing how this new method can enable strong causal analyses in these challenging settings.
Traditional event detection methods heavily rely on manually engineered rich features. Recent deep learning approaches alleviate this problem by automatic feature engineering. But such efforts, like tradition methods, have so far only focused on single-token event mentions, whereas in practice events can also be a phrase. We instead use forward-backward recurrent neural networks (FBRNNs) to detect events that can be either words or phrases. To the best our knowledge, this is one of the first efforts to handle multi-word events and also the first attempt to use RNNs for event detection. Experimental results demonstrate that FBRNN is competitive with the state-of-the-art methods on the ACE 2005 and the Rich ERE 2015 event detection tasks.
Predicting how a proposed cancer treatment will affect a given tumor can be cast as a machine learning problem, but the complexity of biological systems, the number of potentially relevant genomic and clinical features, and the lack of very large scale patient data repositories make this a unique challenge. ‘Pure data’ approaches to this problem are underpowered to detect combinatorially complex interactions and are bound to uncover false correlations despite statistical precautions taken (1). To investigate this setting, we propose a method to integrate simulations, a strong form of prior knowledge, into machine learning, a combination which to date has been largely unexplored. The results of multiple simulations (under various uncertainty scenarios) are used to compute similarity measures between every pair of samples: sample pairs are given a high similarity score if they behave similarly under a wide range of simulation parameters. These similarity values, rather than the original high dimensional feature data, are used to train kernelized machine learning algorithms such as support vector machines, thus handling the curse-of-dimensionality that typically affects genomic machine learning. Using four synthetic datasets of complex systems–three biological models and one network flow optimization model–we demonstrate that when the number of training samples is small compared to the number of features, the simulation kernel approach dominates over no-prior-knowledge methods. In addition to biology and medicine, this approach should be applicable to other disciplines, such as weather forecasting, financial markets, and agricultural management, where predictive models are sought and informative yet approximate simulations are available. The Python SimKern software, the models (in MATLAB, Octave, and R), and the datasets are made freely available at https://…/SimKern.

### Science and Technology links (February 16th, 2018)

1. In all countries, in all years–without exception–girls did better than boys in academic performance (PISA) tests.
2. Vinod Khosla said:

### Magister Dixit

“Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.” H.G. Wells/Samuel S. Wilks ( 1895/1951 )

### Logistic Regression: A Concise Technical Overview

Interested in learning the concepts behind Logistic Regression (LogR)? Looking for a concise introduction to LogR? This article is for you. Includes a Python implementation and links to an R script as well.

### Resurgence of AI During 1983-2010

We discuss supervised learning, unsupervised learning and reinforcement learning, neural networks, and 6 reasons that helped AI Research and Development to move ahead.

### Big Data: Promises, Challenges and Threats

Marketing researchers are wondering what lies ahead for big data. Marketing Scientist Kevin Gray asks Professor Koen Pauwels for his thoughts.

### “Write No Matter What” . . . about what?

Scott Jaschik interviews Joli Jensen (link from Tyler Cowen), a professor of communication who wrote a new book called “Write No Matter What: Advice for Academics.”

Her advice might well be reasonable—it’s hard for me to judge; as someone who blogs a few hundred times a year, I’m not really part of Jensen’s target audience. She offers “a variety of techniques to help . . . reduce writing anxiety; secure writing time, space and energy; recognize and overcome writing myths; and maintain writing momentum.” She recommends “spending at least 15 minutes a day in contact with your writing project . . . writing groups, focusing on accountability (not content critiques), are great ways to maintain weekly writing time commitments.”

Writing is non-algorithmic, and I’ve pushed hard against advice-givers who don’t seem to get that. So, based on this quick interview, my impression is that Jensen’s on the right track.

I’d just like to add one thing: If you want to write, it helps to have something to write about. Even when I have something I really want to say, writing can be hard. I can only imagine how hard it would be if I was just trying to write, to produce, without something I felt it was important to share with the world.

So, when writing, imagine your audience, and ask yourself why they should care. Tell ’em what they don’t know.

Also, when you’re writing, be aware of your audience’s expectations. You can satisfy their expectations or confound their expectations, but it’s good to have a sense of what you’re doing.

And here’s some specific advice about academic writing, from a few years ago.

P.S. In that same post, Cowen also links to a bizarre book review by Edward Luttwak who, among other things, refers to “George Pataki of New York, whose own executive experience as the State governor ranged from the supervision of the New York City subways to the discretionary command of considerable army, air force and naval national guard forces.” The New York Air National Guard, huh? I hate to see the Times Literary Supplement fall for this sort of pontificating. I guess that there will always be a market for authoritative-sounding pundits. But Tyler Cowen should know better. Maybe it was the New York thing that faked him out. If Luttwak had been singing the strategic praises of the New Jersey Air National Guard, that might’ve set off Cowen’s B.S. meter.

The post “Write No Matter What” . . . about what? appeared first on Statistical Modeling, Causal Inference, and Social Science.

### 2018 IEEE Big Data Cup

The IEEE Big Data conference series started in 2013 has established itself as the top tier research conference in Big Data. We invite industrial, government, and academic organizations to submit proposals to organize a Data Challenge for the 2018 IEEE International Conference on Big Data.

### AI: Beyond the Hype and Into Reality

Buzzwords are part of what makes the internet go ’round, and you’d be hard-pressed to find a more popular and controversial term today than Artificial Intelligence (AI). Once just an ethereal concept that interested the nerdiest among us, AI has become a very real obsession in all corners of the tech

The post AI: Beyond the Hype and Into Reality appeared first on Dataconomy.

### Pym.js Library Vulnerability in widgetframe Package

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

## What’s Up?

The NPR Visuals Team created and maintains a javascript library that makes it super easy to embed iframes on web pages and have said documents still be responsive.

The widgetframe R htmlwidget uses pym.js to bring this (much needed) functionality into widgets and (eventually) shiny apps.

NPR reported a critical vulnerability in this library on February 15th, 2018 with no details (said details will be coming next week).

Per NPR’s guidance, any production code using pym.js needs to be pulled or updated to use this new library.

I created an issue & pushed up a PR that incorporates the new version. NOTE that the YAML config file in the existing CRAN package and GitHub dev version incorrectly has 1.3.2 as the version (it’s really the 1.3.1 dev version).

A look at the diff:

suggest that the library was not performing URL sanitization (and now is).

## Watch Out For Standalone Docs

Any R markdown docs compiled in “standalone” mode will need to be recompiled and re-published as the vulnerable pym.js library comes along for the ride in those documents.

Regardless of “standalone mode”, if you used widgetframe in any context, including:

anything created is vulnerable regardless of standalone compilation or not.

## FIN

Once the final details are released I’ll update this post and may do a new post. Until then:

• check if you’ve used widgetframe (directly or indirectly)
• REMOVE ALL VULNERABLE DOCS from RPubs, GitHub pages, your web site (etc) today
• regenerate all standalone documents ASAP
• regenerate your blogs, books, dashboards, etc ASAP with the patched code; DO THIS FOR INTERNAL as well as internet-facing content.
• monitor this space

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Books I liked in 2017 by @ellis2013nz

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

I last blogged about books in September 2016 and it’s still in my 10 most popular posts. I thought I’d update it with some of the data-relevant books I read in 2017. Mostly these are books that were published before 2017 – it’s just that it was last year that I got around to reading them. Well, this is a “web log” (ie “blog”) after all, so what could be more pertinent?

I read a lot of books, and most of them aren’t directly relevant for here (although I always look for interesting topics eg in history that might lead to a post). So I’m only going to talk about those which are somehow related to statistics or data science, and which I rated 3 stars or more. I try to calibrate my ratings to the Goodreads star system, so 3 stars means “I liked it” and I’d be happy recommending people to read it. Whereas 5 stars means something that really made a particular impact on me, and was enjoyable to read to boot. In the data context, it means I’m definitely doing things differently as a result of that book – new techniques, better practice, new ideas, or all three.

So here’s some books I read and liked in 2017.

## 5 stars

### Kimball and Ross, The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling

In 2017 I found myself much deeper and more hands on in the world of data warehousing than I had been, and I did some reading to get my formal understanding up to speed with the very ambitious project I was driving. I knew a bit about data modelling for relational databases, and was familiar with thinking in terms of “facts” and “dimensions”, but I had some definite gaps in the warehouse-specific domain of knowledge. This book was a complete eye-opener, and page after page I found myself saying “yes that makes sense!” “I should have thought of that” and “I wish I’d known that before”.

I can’t recommend this book enough; for a statistician or data scientist who wants to understand how databases can be used to support analytics, this is the one book I’d say is a “must read”. There’s key terminology and process to help you talk with IT; established good practice; and lots of good thoughts on structuring data, handling change, and basically on organising data. It’s based on years of learning the hard way in building datamarts and combining them into warehouses. You’ll find it useful even if you’re not designing and building a database. In fact, I’ll probably write a blog post just on this some time in the next few months.

This is also one book I wish everyone in IT departments reads, or re-reads if necessary. Data warehousing is different to other database design. It has particular problems, and it has standard, tried-and-true solutions. Those standard solutions are neglected at an organisation’s peril. In fact, many of the concepts set out here are ones that I think even non-technical senior managers, involved in data governance for instance, need to understand.

### McElreath, Statistical Rethinking: A Bayesian Course with Examples in R and Stan

2017 was the year I determined to update my understanding of practical modern Bayesianism, and Statistical Rethinking was one of a half dozen books I read or at least started on the subject. (A few of them I haven’t finished yet, particularly two great Andrew Gelman numbers which won’t be covered today.)

Statistical Rethinking is very well written, engaging, humourous and very very eye-opening. I now recommend it to all sorts of people. I think it’s particularly perfect for people who learnt stats the bad way (ie cookbooks) and need a jolt to point them in the much more sustainable and scalable direction of thinking in terms of principles, how we know what we know and philosophy of science. But it’s also good for anyone in the position I was – wanting a friendly introduction into the modern Bayesian world, from whatever level of familiarity with other aspects of data.

While I read this book end to end without doing the exercises because that’s my learning style, they also look really useful for people who like that sort of thing. I also don’t use the R package that came with it as an intermediary between R and Stan but went straight to writing my own Stan programs; that too will be a matter of preference. The book was great as an introduction to Stan when it came to that part, but for me it was its conceptual clarity and communicative teaching style that was really valuable.

### Stan Development Team, Stan Modeling Language User’s Guide and Reference Manual

Yes, 2017 was indeed the year I began to take Bayesianism seriously. Discovering Stan was a big part of that. I incorporated Stan models into my election forecasts and other analysis on this blog like this post on traffic accidents, as part of my self-learning voyage.

The self-learning was possible because Stan is very well documented, and the heart of that is this User’s Guide and Reference Manual. Sure, you’d want to supplement it with more theoretical Bayesian texts and with examples and blog posts, but the guts of the actual language is the manual, as it should be. So even though it’s not a “book” one purchases from a book-seller, I’m including this invaluable document in my five star list for 2017.

## 4 stars

### Duncan, Elliot and Juan Jose Salazar, Statistical Confidentiality: Principles and Practice

This was another work-related one for me. A readable and authoritative text on the latest techniques and challenges of statistical disclosure control. Very much a book for professional statisticians who have to deal with the intriguing and highly problematic area of outthinking the snoopers. But anyone who has ever thought “don’t you just need to delete the names and addresses and then you can release the microdata?” should probably at least skim through this book to get an idea of why national statistical offices always say “I think you’ll find it’s more complicated than that.”

### Robinson, Introduction to Empirical Bayes: Examples from Baseball Statistics

One of the few books on this list that was actually published for the first time in 2017, this is a great introduction to the specific techniques of treating a broad empirical distribution as a prior expectation. You don’t need to be a baseball fan. For those who don’t know, baseball is a ball game, a bit like the rounders game we used to play as kids when we couldn’t get cricket; I understand it’s popular in America where perhaps the turf for cricket pitches can’t grow for climatic reasons.

A read that is both entertaining and edifying, he introduces “the empirical Bayesian approach to estimation, credible intervals, A/B testing, mixture models, and other methods, all through the example of baseball batting averages.” With simple to follow R code to make it all real.

### Kirsanov, The Book of Inkscape: The Definitive Guide to the Free Graphics Editor

Inkscape is a vector graphics creation and editing software, the open source competitor for Adobe Illustrator. Apart from its lack of support for printer-ready CMYK colours, it’s nearly there as a viable professional alternative. It’s definitely good enough for what I need it for, which is touching up the odd SVG graphic, and building the occasional icon or other vector image. It can even be run in batch mode from the command line which is always useful.

Not being a specialist in the field, I don’t know if this book really is the best one, but it certainly is a good book – very comprehensive introduction to the tool.

It’s worth taking this opportunity to list here three critical bits of graphics software that I think repay familiarity for data science types:

• Inkscape (vector graphics editor equivalent of Illustrator)
• Scribus (publishing / layout tool, equivalent of InDesign)
• Gimp (raster graphics editor, equivalent of PhotoShop)

They’re all very powerful, which comes with a learning curve; but it’s worth being aware of what they can do and considering if an investment in more familiarity may be worth while.

### McGrayne, The Theory That Would Not Die

Subtitled “How Bayes’ Rule Cracked The Enigma Code, Hunted Down Russian Submarines, And Emerged Triumphant From Two Centuries Of Controversy”, which pretty much sums it up. A fun and edifying bit of history of the ups and downs in the credibility of Bayes’ rule as a cornerstone of inference about an uncertain world.

### Kasparov, Deep Thinking: Where Machine Intelligence Ends and Human Creativity Begins

One of the famous Machine-defeats-Human images is of course Kasparov in deep thought during his losing match with Deep Blue in 1997. This book gives a brilliant and enthralling human angle on the lead up to that point and Kasparov’s own abiding interest in chess-playing artificial intelligence. He argues convincingly that the particular outcome on that particular day was in fact somewhat unfair (not IBM’s finest hour), while still making clear that this is really not the point, with the incredible developments in statistics since then.

Kasparov is an engaging writer and genuine big picture thinker, so this becomes much more than a personal account of a chess match and gives a nuanced view of what it means to be both or either “creative” and “intelligent”. A good entry point into thinking about machine intelligence. It also has some revealing insights into how high level competitive chess works, but those are bonuses.

### Hendy, Silencing Science

I’ve long had an interest in the politics of public debate by government-funded actors (such as this piece in the overseas aid context). In fact, I’m aware of the boundaries of public comment all the time while working on this blog. In this book based in the New Zealand experience, Hendy argues that there are too many instances of scientists being silenced – either through explicit suppression or institutionally facilitated self-censorship. “Few scientific institutions … feel secure enough to criticise the government of the day.”

It’s a shame (to say the least) that the people who know most about any particular topic are often precluded from contributing to the public debate on it. This issue extends beyond scientists of course – with public servants in particular being subject to ethical and professional constraints of some complexity. A good read.

### Jackman, Bayesian Analysis for the Social Sciences

Another read as part of my self-imposed Bayesian re-education program, and a particularly interesting one for me. This was a good introduction for me to a range of Bayesian methods across social sciences, but was particularly useful for me in its clear explanation of state space modelling of irregularly and imperfectly observed time series processes, using an Australian election as an example. I reworked Jackman’s example, with his data but rewritten in Stan, in a couple of blog posts. I subsequently successfully adapted this method for the multi-party New Zealand election.

### Gelman, Red State, Blue State, Rich State, Poor State: Why Americans Vote the Way They Do

Good introduction to analysis of (pre-2009) US politics by the master statistician.

## 3 stars

Well, this has gotten long hasn’t it… these last few books, all of them worth reading, I’ll leave just as titles and links:

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Four short links: 16 February 2018

Machine Design, Metrics, Layered Learning, and Automatically Mergeable Data Structure

1. Towards Designing Machines -- survey of theory and approaches to building machines that can design things.
2. Review of the Tyranny of Metrics (Tim Hartford) -- Rather than rely on the informed judgment of people familiar with the situation, we gather meaningless numbers at great cost. We then use them to guide our actions, predictably causing unintended damage.
3. Physics Travel Guide -- a tool that makes learning physics easier. Each page here contains three layers which contain explanations with increasing level of sophistication. We call these layers: layman, student and researcher. These layers make sure that readers can always find an explanation they understand. One of these for security or coding would be interesting.
4. Automerge -- A JSON-like data structure that can be modified concurrently by different users, and merged again automatically.

Continue reading Four short links: 16 February 2018.

### Mikaela Shiffrin pulling away for gold

Mikaela Shiffrin won her first gold medal in PyeongChang with a fraction of a second lead. In events where athletes race side-by-side, it’s easier to see how close such a lead is. But with alpine skiing, it feels more like a race against a clock. So to capture some of the dramatics of the former, Derek Watkins and Denise Lu for The New York Times imagined the results had all skiers raced down at the same time.

It reminds of The Times’ coverage of Usain Bolt in the 2012 Summer Olympics.

Tags: , ,

### Whats new on arXiv

Training machine learning (ML) models with large datasets can incur significant resource contention on shared clusters. This training typically involves many iterations that continually improve the quality of the model. Yet in exploratory settings, better models can be obtained faster by directing resources to jobs with the most potential for improvement. We describe SLAQ, a cluster scheduling system for approximate ML training jobs that aims to maximize the overall job quality. When allocating cluster resources, SLAQ explores the quality-runtime trade-offs across multiple jobs to maximize system-wide quality improvement. To do so, SLAQ leverages the iterative nature of ML training algorithms, by collecting quality and resource usage information from concurrent jobs, and then generating highly-tailored quality-improvement predictions for future iterations. Experiments show that SLAQ achieves an average quality improvement of up to 73% and an average delay reduction of up to 44% on a large set of ML training jobs, compared to resource fairness schedulers.
Modern neural networks are very powerful predictive models, but they are often incapable of recognizing when their predictions may be wrong. Closely related to this is the task of out-of-distribution detection, where a network must determine whether or not an input is outside of the set on which it is expected to safely perform. To jointly address these issues, we propose a method of learning confidence estimates for neural networks that is simple to implement and produces intuitively interpretable outputs. We demonstrate that on the task of out-of-distribution detection, our technique surpasses recently proposed techniques which construct confidence based on the network’s output distribution, without requiring any additional labels or access to out-of-distribution examples. Additionally, we address the problem of calibrating out-of-distribution detectors, where we demonstrate that misclassified in-distribution examples can be used as a proxy for out-of-distribution examples.
The aim of knowledge graphs is to gather knowledge about the world and provide a structured representation of this knowledge. Current knowledge graphs are far from complete. To address the incompleteness of the knowledge graphs, link prediction approaches have been developed which make probabilistic predictions about new links in a knowledge graph given the existing links. Tensor factorization approaches have proven promising for such link prediction problems. In this paper, we develop a simple tensor factorization model called SimplE, through a slight modification of the Polyadic Decomposition model from 1927. The complexity of SimplE grows linearly with the size of embeddings. The embeddings learned through SimplE are interpretable, and certain types of expert knowledge in terms of logical rules can be incorporated into these embeddings through weight tying. We prove SimplE is fully-expressive and derive a bound on the size of its embeddings for full expressivity. We show empirically that, despite its simplicity, SimplE outperforms several state-of-the-art tensor factorization techniques.
We propose a simple, tractable lower bound on the mutual information contained in the joint generative density of any latent variable generative model: the GILBO (Generative Information Lower BOund). It offers a data independent measure of the complexity of the learned latent variable description, giving the log of the effective description length. It is well-defined for both VAEs and GANs. We compute the GILBO for 800 GANs and VAEs trained on MNIST and discuss the results.
In this work, we investigate Batch Normalization technique and propose its probabilistic interpretation. We propose a probabilistic model and show that Batch Normalization maximazes the lower bound of its marginalized log-likelihood. Then, according to the new probabilistic model, we design an algorithm which acts consistently during train and test. However, inference becomes computationally inefficient. To reduce memory and computational cost, we propose Stochastic Batch Normalization — an efficient approximation of proper inference procedure. This method provides us with a scalable uncertainty estimation technique. We demonstrate the performance of Stochastic Batch Normalization on popular architectures (including deep convolutional architectures: VGG-like and ResNets) for MNIST and CIFAR-10 datasets.
An accelerator is a specialized integrated circuit designed to perform specific computations faster than if those were performed by CPU or GPU. A Field-Programmable DNN learning and inference accelerator (FProg-DNN) using hybrid systolic and non-systolic techniques, distributed information-control and deep pipelined structure is proposed and its microarchitecture and operation presented here. Reconfigurability attends diverse DNN designs and allows for different number of workers to be assigned to different layers as a function of the relative difference in computational load among layers. The computational delay per layer is made roughly the same along pipelined accelerator structure. VGG-16 and recently proposed Inception Modules are used for showing the flexibility of the FProg-DNN reconfigurability. Special structures were also added for a combination of convolution layer, map coincidence and feedback for state of the art learning with small set of examples, which is the focus of a companion paper by the author (Franca-Neto, 2018). The accelerator described is able to reconfigure from (1) allocating all a DNN computations to a single worker in one extreme of sub-optimal performance to (2) optimally allocating workers per layer according to computational load in each DNN layer to be realized. Due the pipelined architecture, more than 50x speedup is achieved relative to GPUs or TPUs. This speed-up is consequence of hiding the delay in transporting activation outputs from one layer to the next in a DNN behind the computations in the receiving layer. This FProg-DNN concept has been simulated and validated at behavioral-functional level.
Many generative models attempt to replicate the density of their input data. However, this approach is often undesirable, since data density is highly affected by sampling biases, noise, and artifacts. We propose a method called SUGAR (Synthesis Using Geometrically Aligned Random-walks) that uses a diffusion process to learn a manifold geometry from the data. Then, it generates new points evenly along the manifold by pulling randomly generated points into its intrinsic structure using a diffusion kernel. SUGAR equalizes the density along the manifold by selectively generating points in sparse areas of the manifold. We demonstrate how the approach corrects sampling biases and artifacts, while also revealing intrinsic patterns (e.g. progression) and relations in the data. The method is applicable for correcting missing data, finding hypothetical data points, and learning relationships between data features.
With an increasing use of data-driven models to control robotic systems, it has become important to develop a methodology for validating such models before they can be deployed to design a controller for the actual system. Specifically, it must be ensured that the controller designed for an abstract or learned model would perform as expected on the actual physical system. We propose a context-specific validation framework to quantify the quality of a learned model based on a distance metric between the closed-loop actual system and the learned model. We then propose an active sampling scheme to compute a probabilistic upper bound on this distance in a sample-efficient manner. The proposed framework validates the learned model against only those behaviors of the system that are relevant for the purpose for which we intend to use this model, and does not require any a priori knowledge of the system dynamics. Several simulations illustrate the practicality of the proposed framework for validating the models of real-world systems.
Neural networks have been shown to be an effective tool for learning algorithms over graph-structured data. However, graph representation techniques–that convert graphs to real-valued vectors for use with neural networks–are still in their infancy. Recent works have proposed several approaches (e.g., graph convolutional networks), but these methods have difficulty scaling and generalizing to graphs with different sizes and shapes. We present Graph2Seq, a new technique that represents graphs as an infinite time-series. By not limiting the representation to a fixed dimension, Graph2Seq scales naturally to graphs of arbitrary sizes and shapes. Graph2Seq is also reversible, allowing full recovery of the graph structure from the sequence. By analyzing a formal computational model for graph representation, we show that an unbounded sequence is necessary for scalability. Our experimental results with Graph2Seq show strong generalization and new state-of-the-art performance on a variety of graph combinatorial optimization problems.
For many machine learning problem settings, particularly with structured inputs such as sequences or sets of objects, a distance measure between inputs can be specified more naturally than a feature representation. However, most standard machine models are designed for inputs with a vector feature representation. In this work, we consider the estimation of a function $f:\mathcal{X} \rightarrow \R$ based solely on a dissimilarity measure $d:\mathcal{X}\times\mathcal{X} \rightarrow \R$ between inputs. In particular, we propose a general framework to derive a family of \emph{positive definite kernels} from a given dissimilarity measure, which subsumes the widely-used \emph{representative-set method} as a special case, and relates to the well-known \emph{distance substitution kernel} in a limiting case. We show that functions in the corresponding Reproducing Kernel Hilbert Space (RKHS) are Lipschitz-continuous w.r.t. the given distance metric. We provide a tractable algorithm to estimate a function from this RKHS, and show that it enjoys better generalizability than Nearest-Neighbor estimates. Our approach draws from the literature of Random Features, but instead of deriving feature maps from an existing kernel, we construct novel kernels from a random feature map, that we specify given the distance measure. We conduct classification experiments with such disparate domains as strings, time series, and sets of vectors, where our proposed framework compares favorably to existing distance-based learning methods such as $k$-nearest-neighbors, distance-substitution kernels, pseudo-Euclidean embedding, and the representative-set method.
In this paper, we propose a robust change detection method for intelligent visual surveillance. This method, named M4CD, includes three major steps. Firstly, a sample-based background model that integrates color and texture cues is built and updated over time. Secondly, multiple heterogeneous features (including brightness variation, chromaticity variation, and texture variation) are extracted by comparing the input frame with the background model, and a multi-source learning strategy is designed to online estimate the probability distributions for both foreground and background. The three features are approximately conditionally independent, making multi-source learning feasible. Pixel-wise foreground posteriors are then estimated with Bayes rule. Finally, the Markov random field (MRF) optimization and heuristic post-processing techniques are used sequentially to improve accuracy. In particular, a two-layer MRF model is constructed to represent pixel-based and superpixel-based contextual constraints compactly. Experimental results on the CDnet dataset indicate that M4CD is robust under complex environments and ranks among the top methods.
Existing defects in software components is unavoidable and leads to not only a waste of time and money but also many serious consequences. To build predictive models, previous studies focus on manually extracting features or using tree representations of programs, and exploiting different machine learning algorithms. However, the performance of the models is not high since the existing features and tree structures often fail to capture the semantics of programs. To explore deeply programs’ semantics, this paper proposes to leverage precise graphs representing program execution flows, and deep neural networks for automatically learning defect features. Firstly, control flow graphs are constructed from the assembly instructions obtained by compiling source code; we thereafter apply multi-view multi-layer directed graph-based convolutional neural networks (DGCNNs) to learn semantic features. The experiments on four real-world datasets show that our method significantly outperforms the baselines including several other deep learning approaches.
Clustering consists of grouping together samples giving their similar properties. The problem of modeling simultaneously groups of samples and features is known as Co-Clustering. This paper introduces ROCCO – a Robust Continuous Co-Clustering algorithm. ROCCO is a scalable, hyperparameter-free, easy and ready to use algorithm to address Co-Clustering problems in practice over massive cross-domain datasets. It operates by learning a graph-based two-sided representation of the input matrix. The underlying proposed optimization problem is non-convex, which assures a flexible pool of solutions. Moreover, we prove that it can be solved with a near linear time complexity on the input size. An exhaustive large-scale experimental testbed conducted with both synthetic and real-world datasets demonstrates ROCCO’s properties in practice: (i) State-of-the-art performance in cross-domain real-world problems including Biomedicine and Text Mining; (ii) very low sensitivity to hyperparameter settings; (iii) robustness to noise and (iv) a linear empirical scalability in practice. These results highlight ROCCO as a powerful general-purpose co-clustering algorithm for cross-domain practitioners, regardless of their technical background.
Causal inference analysis is the estimation of the effects of actions on outcomes. In the context of healthcare data this means estimating the outcome of counter-factual treatments (i.e. including treatments that were not observed) on a patient’s outcome. Compared to classic machine learning methods, evaluation and validation of causal inference analysis is more challenging because ground truth data of counter-factual outcome can never be obtained in any real-world scenario. Here, we present a comprehensive framework for benchmarking algorithms that estimate causal effect. The framework includes unlabeled data for prediction, labeled data for validation, and code for automatic evaluation of algorithm predictions using both established and novel metrics. The data is based on real-world covariates, and the treatment assignments and outcomes are based on simulations, which provides the basis for validation. In this framework we address two questions: one of scaling, and the other of data-censoring. The framework is available as open source code at https://…-Causal-Inference-Benchmarking-Framework.
Data sizes that cannot be processed by conventional data storage and analysis systems are named as Big Data.It also refers to nex technologies developed to store, process and analyze large amounts of data. Automatic information retrieval about the contents of a large number of documents produced by different sources, identifying research fields and topics, extraction of the document abstracts, or discovering patterns are some of the topics that have been studied in the field of big data.In this study, Naive Bayes classification algorithm, which is run on a data set consisting of scientific articles, has been tried to automatically determine the classes to which these documents belong. We have developed an efficient system that can analyze the Turkish scientific documents with the distributed document classification algorithm run on the Cloud Computing infrastructure. The Apache Mahout library is used in the study. The servers required for classifying and clustering distributed documents are
Crowdsourcing is an important avenue for collecting machine learning data, but crowdsourcing can go beyond simple data collection by employing the creativity and wisdom of crowd workers. Yet crowd participants are unlikely to be experts in statistics or predictive modeling, and it is not clear how well non-experts can contribute creatively to the process of machine learning. Here we study an end-to-end crowdsourcing algorithm where groups of non-expert workers propose supervised learning problems, rank and categorize those problems, and then provide data to train predictive models on those problems. Problem proposal includes and extends feature engineering because workers propose the entire problem, not only the input features but also the target variable. We show that workers without machine learning experience can collectively construct useful datasets and that predictive models can be learned on these datasets. In our experiments, the problems proposed by workers covered a broad range of topics, from politics and current events to problems capturing health behavior, demographics, and more. Workers also favored questions showing positively correlated relationships, which has interesting implications given many supervised learning methods perform as well with strong negative correlations. Proper instructions are crucial for non-experts, so we also conducted a randomized trial to understand how different instructions may influence the types of problems proposed by workers. In general, shifting the focus of machine learning tasks from designing and training individual predictive models to problem proposal allows crowdsourcers to design requirements for problems of interest and then guide workers towards contributing to the most suitable problems.
Deep convolutional network has been the state-of-the-art approach for a wide variety of tasks over the last few years. Its successes have, in many cases, turned it into the default model in quite a few domains. In this work we will demonstrate that convolutional networks have limitations that may, in some cases, hinder it from learning properties of the data, which are easily recognizable by traditional, less demanding, models. To this end, we present a series of competitive analysis studies on image recognition and text analysis tasks, for which convolutional networks are known to provide state-of-the-art results. In our studies, we inject a truth-reveling signal, indiscernible for the network, thus hitting time and again the network’s blind spots. The signal does not impair the network’s existing performances, but it does provide an opportunity for a significant performance boost by models that can capture it. The various forms of the carefully designed signals shed a light on the strengths and weaknesses of convolutional network, which may provide insights for both theoreticians that study the power of deep architectures, and for practitioners that consider to apply convolutional networks to the task at hand.
In real-world applications of education and human teaching, an effective teacher chooses the next example intelligently based on the learner’s current state. However, most of the existing works in algorithmic machine teaching focus on the batch setting, where adaptivity plays no role. In this paper, we study the case of teaching consistent, version space learners in an interactive setting—at any time step, the teacher provides an example, the learner performs an update, and the teacher observes the learner’s new state. We highlight that adaptivity does not speed up the teaching process when considering existing models of version space learners, such as the ‘worst-case’ model (the learner picks the next hypothesis randomly from the version space) and ‘preference-based’ model (the learner picks hypothesis according to some global preference). Inspired by human teaching, we propose a new model where the learner picks hypothesis according to some local preference defined by the current hypothesis. We show that our model exhibits several desirable properties, e.g., adaptivity plays a key role, and the learner’s transitions over hypotheses are smooth/interpretable. We develop efficient teaching algorithms for our model, and demonstrate our results via simulations as well as user studies.
Collaboration requires coordination, and we coordinate by anticipating our teammates’ future actions and adapting to their plan. In some cases, our teammates’ actions early on can give us a clear idea of what the remainder of their plan is, i.e. what action sequence we should expect. In others, they might leave us less confident, or even lead us to the wrong conclusion. Our goal is for robot actions to fall in the first category: we want to enable robots to select their actions in such a way that human collaborators can easily use them to correctly anticipate what will follow. While previous work has focused on finding initial plans that convey a set goal, here we focus on finding two portions of a plan such that the initial portion conveys the final one. We introduce $t$-\ACty{}: a measure that quantifies the accuracy and confidence with which human observers can predict the remaining robot plan from the overall task goal and the observed initial $t$ actions in the plan. We contribute a method for generating $t$-predictable plans: we search for a full plan that accomplishes the task, but in which the first $t$ actions make it as easy as possible to infer the remaining ones. The result is often different from the most efficient plan, in which the initial actions might leave a lot of ambiguity as to how the task will be completed. Through an online experiment and an in-person user study with physical robots, we find that our approach outperforms a traditional efficiency-based planner in objective and subjective collaboration metrics.

### Mix ggplot2 graphs with your favorite memes. memery 0.4.2 released.

(This article was first published on rbloggers – SNAP tech blog, and kindly contributed to R-bloggers)

Make memorable plots with memery. memery is an R package that generates internet memes including superimposed inset graphs and other atypical features, combining the visual impact of an attention-grabbing meme with graphic results of data analysis. Version 0.4.2 of memery is now on CRAN. The latest development version and a package vignette are available on GitHub.

## Changes in v0.4.2

This latest version of memery includes a demo Shiny app.

library(memery)
memeApp()


Animated gif support is now also included (example below). This relies on the magick package and ImageMagick software, but this is optional and these libraries are not required for you to use memery if you have no interest in animated gifs. For example, when launching the demo Shiny app, if you do not have these libraries installed on your system, the app launches in your browser in a simplified form. It will only accept png and jpg files as inputs and a default static image will be shown at startup. Alternatively, the app launches in full mode, will also accept gif inputs, and the default image shown is an animated gif. The only function in memery that pertains to gifs is meme_gif, which is distinct from the main package function, meme. If you call meme_gif without the supporting libraries, it simply prints a notification about this to the console.

## Example usage

Below is an example interleaving a semi-transparent ggplot2 graph between a meme image backdrop and overlying meme text labels. The meme function will produce basic memes without needing to specify a number of additional arguments, but this is not the main purpose of the package. Adding a plot is then as simple as passing the plot to inset.

memery offers sensible defaults as well as a variety of basic templates for controlling how the meme and graph are spliced together. The example here shows how additional arguments can be specified to further control the content and layout. See the package vignette for a more complete set of examples and description of available features and graph templates.

Please do share your data analyst meme creations. Enjoy!

library(memery)

# Make a graph of some data
library(ggplot2)
x <- seq(0, 2*pi , length.out = 50)
panels <- rep(c("Plot A", "Plot B"), each = 50)
d <- data.frame(x = x, y = sin(x), grp = panels)
txt <- c("Philosoraptor's plots", "I like to make plots",
"Figure 1. (A) shows a plot and (B) shows another plot.")
p <- ggplot(d, aes(x, y)) + geom_line(colour = "cornflowerblue", size = 2) +
geom_point(colour = "orange", size = 4) + facet_wrap(~grp) +
labs(title = txt[1], subtitle = txt[2], caption = txt[3])

# Meme settings
img <- system.file("philosoraptor.jpg", package = "memery") # image
lab <- c("What to call my R package?", "Hmm... What? raptr is taken!?", "Noooooo!!!!") # labels
size <- c(1.8, 1.5, 2.2) # label sizes, positions, font families and colors
pos <- list(w = rep(0.9, 3), h = rep(0.3, 3), x = c(0.45, 0.6, 0.5), y = c(0.95, 0.85, 0.3))
fam <- c("Impact", "serif", "Impact")
col <- list(c("black", "orange", "white"), c("white", "black", "black"))
gbg <- list(fill = "#FF00FF50", col = "#FFFFFF75") # graph background

# Save meme
meme(img, lab, "meme.jpg", size = size, family = fam, col = col[[1]],
shadow = col[[2]], label_pos = pos, inset = p, inset_bg = gbg, mult = 2)


## Animated gif example

d$grp <- gsub("Plot", "Cat's Plot", d$grp)
p <- ggplot(d, aes(x, y)) + geom_line(colour = "white", size = 2) +
geom_point(colour = "orange", size = 1) + facet_wrap(~grp) +
labs(title = "The wiggles", subtitle = "Plots for cats",
caption = "Figure 1. Gimme sine waves.")
lab <- c("R plots for cats", "Sine wave sine wave...")
pos <- list(w = rep(0.9, 2), h = rep(0.3, 2), x = rep(0.5, 2), y = c(0.9, 0.75))
img <- "http://forgifs.com/gallery/d/228621-4/Cat-wiggles.gif"
meme_gif(img, lab, "sine.gif", size = c(1.5, 0.75), label_pos = pos,
inset = p, inset_bg = list(fill = "#00BFFF80"), fps = 20)


To leave a comment for the author, please follow the link and comment on their blog: rbloggers – SNAP tech blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Magister Dixit

“Vendors are there to sell you a tool for a problem you may or may not have yet, and they’re very good at convincing you that you need it whether you actually need it or not.” John Foreman

### Distilled News

Principal Component Analysis (PCA) is unsupervised learning technique and it is used to reduce the dimension of the data with minimum loss of information. PCA is used in an application like face recognition and image compression.
This Monday, February the 12th, we launched a public beta of Datalore – an intelligent web application for data analysis and visualization in Python, brought to you by JetBrains. This tool turns the data science workflow into a delightful experience with the help of smart coding assistance, incremental computations, and built-in tools for machine learning.
On a regular basis people tell me about their impressive achievements using AI. 99% of these things are completely stupid. This post may come off as a rant, but that’s not so much its intent, as it is to point out why we went from having very few AI experts, to having so many in so little time. Also to convey that most of these experts only seem experty because so few people know how to call them on their bull shit.
Imagine your Aunt Ida is in an autonomous vehicle (AV) – a self-driving car – on a city street closed to human-driven vehicles. Imagine a swarm of puppies drops from an overpass, a sinkhole opens up beneath a bus full of mathematical geniuses, or Beethoven (or Tupac) jumps into the street from the left as Mozart (or Biggie) jumps in from the right. Whatever the dilemma, imagine that the least worst option for the network of AVs is to drive the car containing your Aunt Ida into a concrete abutment. Even if the system made the right choice – all other options would have resulted in more deaths – you’d probably want an explanation.
A reinforcement learning algorithm engaging in policy improvement from a continuous stream of experience needs to solve an opportunity-cost problem. (The RL lingo for opportunity-cost is “advantage”.) Thinking about this in the context of a 2-person game, at a given state, with your existing rollout policy, is taking the first action leading to a win 1/2 the time good or bad? It could be good since the player is well behind and every other action is worse. Or it could be bad since the player is well ahead and every other action is better. Understanding one action’s long term value relative to another’s is the essence of the opportunity cost trade-off at the core of many reinforcement learning algorithms.
In this episode of the Data Show, I spoke with Leo Meyerovich, co-founder and CEO of Graphistry. Graphs have always been part of the big data revolution (think of the large graphs generated by the early social media startups). In recent months, I’ve come across companies releasing and using new tools for creating, storing, and (most importantly) analyzing large graphs. There are many problems and use cases that lend themselves naturally to graphs, and recent advances in hardware and software building blocks have made large-scale analytics possible.
Deep learning has proven its effectiveness in many fields, such as computer vision, natural language processing (NLP), text translation, or speech to text. It takes its name from the high number of layers used to build the neural network performing machine learning tasks. There are several types of layers as well as overall network architectures, but the general rule holds that the deeper the network is, the more complexity it can grasp. This article will explain fundamental concepts of neural network layers and walk through the process of creating several types using TensorFlow. TensorFlow is the platform that contributed to making artificial intelligence (AI) available to the broader public. It’s an open source library with a vast community and great support. TensorFlow provides a set of tools for building neural network architectures, and then training and serving the models. It offers different levels of abstraction, so you can use it for cut-and-dried machine learning processes at a high level or go more in-depth and write the low-level calculations yourself.
Yesterday I was part of an introductory session on machine learning and unsurprisingly, the issue of supervised vs. unsupervised learning came up. In social sciences, there is a definite tendency for the former; there is more or less always a target outcome or measure that we want to optimise the performance of our models for. This reminded me of a draft that I had written the code a couple of months ago but for some reason never converted into a blog post until now. This will also allow me to take a break from conflict forecasting for a bit and go back to my usual topic of UK. My frequent usage of all things UK is at such a level that my Google Search Console insights lists r/CasualUK as the top referrer. Cheers, mates!
This blog post is about my newly released RGF package (the blog post consists mainly of the package Vignette). The RGF package is a wrapper of the Regularized Greedy Forest python package, which also includes a Multi-core implementation (FastRGF). Portability from Python to R was made possible using the reticulate package and the installation requires basic knowledge of Python. Except for the Linux Operating System, the installation on Macintosh and Windows might be somehow cumbersome (on windows the package currently can be used only from within the command prompt). Detailed installation instructions for all three Operating Systems can be found in the README.md file and in the rgf_python Github repository.

### A Recent History of the Second Amendment

Data from MassShootingTracker.org. There's not much to the code to generate it, but here it is:

library(tidyverse)
theme_set(theme_bw(14))

dd1 <- read_csv("/tmp/MST Data 2013 - 2015.csv")
dd2 <- read_csv("/tmp/MST Data 2014 - 2015.csv")
dd3 <- read_csv("/tmp/MST Data 2015 - 2015.csv")
dd4 <- read_csv("/tmp/Mass Shooting Data 2016 - 2016.csv")
dd5 <- read_csv("/tmp/Mass Shooting Data 2017 - 2017.csv")
dd6 <- read_csv("/tmp/Mass Shooting Data 2018 - 2018.csv")

dd <- rbind(dd1, dd2, dd3, dd4, dd5, dd6)

dd$year <- sapply(dd$date, function(x) as.numeric(strsplit(x, "/", fixed = TRUE)[[1]][3]))
dd$year <- ifelse(dd$year < 2000, dd$year + 2000, dd$year)
dd$ov <- 0 ggdata <- dd %>% group_by(year) %>% summarise(nkilled = sum(killed), nwounded = sum(wounded), znoverthrown = sum(ov)) %>% gather(key, value, -year) plt1 <- ggplot(ggdata) + geom_line(aes(x = year, y = value, group = key, color = key)) + geom_point(aes(x = year, y = value, color = key)) + geom_text(data = ggdata %>% filter(key == "znoverthrown"), aes(x = year, y = value, group = key, label = value), vjust = -0.7, nudge_x = 0) + xlab("Year") + ylab("Count") + ggtitle("A Recent History of the Second Amendment", subtitle = "Mass shootings in the US, 2013 - 2018. Data from MassShootingTracker.org. Partial data for 2018." ) + scale_color_brewer(palette = "Set1", name = NULL, labels = c("Dead", "Wounded", "Tyrannical governments overthrown by a well regulated militia")) + theme(legend.position = "top", legend.direction = "horizontal") plot(plt1)  Continue Reading… ### R Packages worth a look R to Symbolic Data Analysis (RSDA) Symbolic Data Analysis (SDA) was proposed by professor Edwin Diday in 1987, the main purpose of SDA is to substitute the set of rows (cases) in the data table for a concept (second order statistical unit). This package implements, to the symbolic case, certain techniques of automatic classification, as well as some linear models. Lava Estimation for the Sum of Sparse and Dense Signals (Lavash) The lava estimation is a new technique to recover signals that is the sum of a sparse and dense signals. The post-lava method corrects the shrinkage bias of lava. For more information on the lava estimation, see Chernozhukov, Hansen, and Liao (2017) <doi:10.1214/16-AOS1434>. Bayesian Calculation of Region-Specific Fixation Index to Detect Local Adaptation (BlockFeST) An R implementation of an extension of the ‘BayeScan’ software (Foll, 2008) <DOI:10.1534/genetics.108.092221> for codominant markers, adding the option to group individual SNPs into pre-defined blocks. A typical application of this new approach is the identification of genomic regions, genes, or gene sets containing one or more SNPs that evolved under directional selection. Continue Reading… ### RSiteCatalyst Version 1.4.14 Release Notes (This article was first published on randyzwitch.com, and kindly contributed to R-bloggers) Like the last several updates, this blog post will be fairly short, given only a single bug fix was added. Thanks again to GitHub user leocwlau who reported that the GetReportSuiteGroups function added an additional field AND provided the solution. No other bug fixes were made, nor was any additional functionality added. Version 1.4.14 of RSiteCatalyst was submitted to CRAN today and should be available for download in the coming days. ## Community Contributions As I’ve mentioned in many a blog post before this one, I encourage all users of the software to continue reporting bugs via GitHub issues, and especially if you can provide a working code example. Even better, a fix via pull request will ensure that your bug will be addressed in a timely manner and for the benefit to others in the community. Note: Please don’t email directly via the email in the RSiteCatalyst package, it will not be returned. Having a valid email contact in the package is a requirement to have a package listed on CRAN so they can contact the package author, it is not meant to imply I can/will provide endless, personalized support for free. To leave a comment for the author, please follow the link and comment on their blog: randyzwitch.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Importing 30GB of data in R with sparklyr (This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers) Disclaimer: the first part of this blog post draws heavily from Working with CSVs on the Command Line, which is a beautiful resource that lists very nice tips and tricks to work with CSV files before having to load them into R, or any other statistical software. I highly recommend it! Also, if you find this interesting, read also Data Science at the Command Line another great resource! In this blog post I am going to show you how to analyze 30GB of data. 30GB of data does not qualify as big data, but it’s large enough that you cannot simply import it into R and start working on it, unless you have a machine with a lot of RAM. Let’s start by downloading some data. I am going to import and analyze (very briefly) the airline dataset that you can download from Microsoft here. I downloaded the file AirOnTimeCSV.zip from AirOnTime87to12. Once you decompress it, you’ll end up with 303 csv files, each around 80MB. Before importing them into R, I will use command line tools to bind the rows together. But first, let’s make sure that the datasets all have the same columns. I am using Linux, and if you are too, or if you are using macOS, you can follow along. Windows users that installed the Linux Subsystem can also use the commands I am going to show! First, I’ll use the head command in bash. If you’re familiar with head() from R, the head command in bash works exactly the same: [18-02-15 21:12] brodriguesco in /Documents/AirOnTimeCSV ➤ head -5 airOT198710.csv "YEAR","MONTH","DAY_OF_MONTH","DAY_OF_WEEK","FL_DATE","UNIQUE_CARRIER","TAIL_NUM","FL_NUM", 1987,10,1,4,1987-10-01,"AA","","1",12478,"JFK","NY",12892,"LAX","CA","0900","0901",1.00, 1987,10,2,5,1987-10-02,"AA","","1",12478,"JFK","NY",12892,"LAX","CA","0900","0901",1.00 1987,10,3,6,1987-10-03,"AA","","1",12478,"JFK","NY",12892,"LAX","CA","0900","0859",-1.00 1987,10,4,7,1987-10-04,"AA","","1",12478,"JFK","NY",12892,"LAX","CA","0900","0900",0.00, let’s also check the 5 first lines of the last file: [18-02-15 21:13] cbrunos in brodriguesco in /Documents/AirOnTimeCSV ➤ head -5 airOT201212.csv "YEAR","MONTH","DAY_OF_MONTH","DAY_OF_WEEK","FL_DATE","UNIQUE_CARRIER","TAIL_NUM","FL_NUM", 2012,12,1,6,2012-12-01,"AA","N322AA","1",12478,"JFK","NY",12892,"LAX","CA","0900","0852", 2012,12,2,7,2012-12-02,"AA","N327AA","1",12478,"JFK","NY",12892,"LAX","CA","0900","0853", 2012,12,3,1,2012-12-03,"AA","N319AA","1",12478,"JFK","NY",12892,"LAX","CA","0900","0856" 2012,12,4,2,2012-12-04,"AA","N329AA","1",12478,"JFK","NY",12892,"LAX","CA","0900","1006" Why do that in bash instead of R? This way, I don’t need to import the data into R before checking its contents! It does look like the structure did not change. Before importing the data into R, I am going to bind the rows of the datasets using other command line tools. Again, the reason I don’t import all the files into R is because I would need around 30GB of RAM to do so. So it’s easier to do it with bash: head -1 airOT198710.csv > combined.csv for file in$(ls airOT*); do cat $file | sed "1 d" >> combined.csv; done On the first line I use head again to only copy the column names (the first line of the first file) into a new file called combined.csv. This > operator looks like the now well known pipe operator in R, %>%, but in bash, %>% is actually |, not >. > redirects the output of the left hand side to a file on the right hand side, not to another command. On the second line, I loop over the files. I list the files with ls, and because I want only to loop over those that are named airOTxxxxx I use a regular expression, airOT* to only list those. The second part is do cat$file. do is self-explanatory, and cat stands for catenate. Think of it as head, but on all rows instead of just 5; it prints $file to the terminal. $file one element of the list of files I am looping over. But because I don’t want to see the contents of $file on my terminal, I redirect the output with the pipe, | to another command, sed. sed has an option, "1 d", and what this does is filtering out the first line, containing the header, from $file before appending it with >> to combined.csv. If you found this interesting, read more about it here.

This creates a 30GB CSV file that you can then import. But how? There seems to be different ways to import and work with larger than memory data in R using your personal computer. I chose to use {sparklyr}, an R package that allows you to work with Apache Spark from R. Apache Spark is a fast and general engine for large-scale data processing, and {sparklyr} not only offers bindings to it, but also provides a complete {dplyr} backend. Let’s start:

library(sparklyr)
library(tidyverse)

spark_dir = "/my_2_to_disk/spark/"

I first load {sparklyr} and the {tidyverse} and also define a spark_dir. This is because Spark creates a lot of temporary files that I want to save there instead of my root partition, which is on my SSD. My root partition only has around 20GO of space left, so whenever I tried to import the data I would get the following error:

java.io.IOException: No space left on device

In order to avoid this error, I define this directory on my 2TO hard disk. I then define the temporary directory using the two lines below:

config = spark_config()

config$sparklyr.shell.driver-java-options <- paste0("-Djava.io.tmpdir=", spark_dir) This is not sufficient however; when I tried to read in the data, I got another error: java.lang.OutOfMemoryError: Java heap space The solution for this one is to add the following lines to your config(): config$sparklyr.shell.driver-memory <- "4G"
config$sparklyr.shell.executor-memory <- "4G" config$spark.yarn.executor.memoryOverhead <- "512"

Finally, I can load the data. Because I am working on my machine, I connect to a "local" Spark instance. Then, using spark_read_csv(), I specify the Spark connection, sc, I give a name to the data that will be inside the database and the path to it:

sc = spark_connect(master = "local", config = config)

air = spark_read_csv(sc, name = "air", path = "combined.csv")

On my machine, this took around 25 minutes, and RAM usage was around 6GO.

It is possible to use standard {dplyr} verbs with {sparklyr} objects, so if I want the mean delay at departure per day, I can simply write:

tic = Sys.time()
mean_dep_delay = air %>%
group_by(YEAR, MONTH, DAY_OF_MONTH) %>%
summarise(mean_delay = mean(DEP_DELAY))
(toc = Sys.time() - tic)
Time difference of 0.05634999 secs

That’s amazing, only 0.06 seconds to compute these means! Wait a minute, that’s weird… I mean my computer is brand new and quite powerful but still… Let’s take a look at mean_dep_delay:

head(mean_dep_delay)
# Source:   lazy query [?? x 4]
# Database: spark_connection
# Groups:   YEAR, MONTH
YEAR MONTH DAY_OF_MONTH mean_delay

1  1987    10            9       6.71
2  1987    10           10       3.72
3  1987    10           12       4.95
4  1987    10           14       4.53
5  1987    10           23       6.48
6  1987    10           29       5.77
Warning messages:
1: Missing values are always removed in SQL.
Use AVG(x, na.rm = TRUE) to silence this warning
2: Missing values are always removed in SQL.
Use AVG(x, na.rm = TRUE) to silence this warning

Surprisingly, this takes around 5 minutes to print? Why? Look at the class of mean_dep_delay: it’s a lazy query that only gets evaluated once I need it. Look at the first line; lazy query [?? x 4]. This means that I don’t even know how many rows are in mean_dep_delay! The contents of mean_dep_delay only get computed once I explicitly ask for them. I do so with the collect() function, which transfers the Spark object into R’s memory:

tic = Sys.time()
r_mean_dep_delay = collect(mean_dep_delay)
(toc = Sys.time() - tic)
Time difference of 5.2399 mins

Also, because it took such a long time to compute: I save it to disk:

saveRDS(r_mean_dep_delay, "mean_dep_delay.rds")

So now that I transferred this sparklyr table to a standard tibble in R, I can create a nice plot of departure delays:

library(lubridate)

dep_delay =  r_mean_dep_delay %>%
arrange(YEAR, MONTH, DAY_OF_MONTH) %>%
mutate(date = ymd(paste(YEAR, MONTH, DAY_OF_MONTH, sep = "-")))

ggplot(dep_delay, aes(date, mean_delay)) + geom_smooth()
## geom_smooth() using method = 'gam'

That’s it for now, but in a future blog post I will continue to explore this data!

If you found this blog post useful, you might want to follow me on twitter for blog post updates.

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Affordable automatic deployment of Spark and HDFS with Kubernetes and Gitlab CI/CD

(This article was first published on Angel Sevilla Camins' Blog, and kindly contributed to R-bloggers)

## Summary

Running an application on Spark with external dependencies, such as R and python packages, requires the installation of these dependencies on all the workers. To automate this tedious process, a continuous deployment workflow has been developed using Gitlab CI/CD. This workflow consists of: (i) Building the HDFS and Spark docker images with the required dependencies for workers and the master (Python and R), (ii) deploying the images on a Kubernetes cluster. For this, we will be using an affordable cluster made of mini PCs. More importantly, we will demonstrate that this cluster is fully operational. The Spark cluster is accessible using Spark UI, Zeppelin and R Studio. In addition, HDFS is fully integrated together with Kubernetes. Source code for both the custom Docker images and the Kubernetes objects definitions can be found here and here respectively.

Go here to read the entire blog.

To leave a comment for the author, please follow the link and comment on their blog: Angel Sevilla Camins' Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...