My Data Science Blogs

March 26, 2019

Pedestrian Detection in Aerial Images Using RetinaNet

Object Detection in Aerial Images is a challenging and interesting problem. By using Keras to train a RetinaNet model for object detection in aerial images, we can use it to extract valuable information.

Continue Reading…


Read More

Most Americans like big businesses.

Tyler Cowen asks:

Why is there so much suspicion of big business?

Perhaps in part because we cannot do without business, so many people hate or resent business, and they love to criticize it, mock it, and lower its status. Business just bugs them. . . .

The short answer is, No, I don’t think there is so much suspicion of big business in this country. No, I don’t think people love to criticize, mock and lower the status of big business.

This came up a few years ago, and at the time I pulled out data from a 2007 survey showing that just about every big business you could think of was popular, with the only exception being oil companies. Microsoft, Walmart, Citibank, GM, Pfizer: you name it, the survey respondents were overwhelmingly positive.

Nearly two-thirds of respondents say corporate profits are too high, but, “more than seven in ten agree that ‘the strength of this country today is mostly based on the success of American business’ – an opinion that has changed very little over the past 20 years.”

Corporations are more popular with Republicans than with Democrats, but most of the corporations in the survey were popular with a clear majority in either party.

Big business does lots of things for us, and the United States is a proudly capitalist country, so it’s no shocker that most businesses in the survey were very popular.

So maybe the question is, Why did an economist such as Cowen think that people view big business so negatively?

My quick guess is that we notice negative statements more than positive statements. Cowen himself roots for big business, he’s generally on the side of big business, so when he sees any criticism of it, he bristles. He notices the criticism and is bothered by it. When he sees positive statements about big business, that all seems so sensible that perhaps he hardly notices. The negative attitudes are jarring to him so more noticeable. Perhaps in the same way that I notice bad presentations of data. An ugly table or graph is to me like fingernails on the blackboard.

Anyway, it’s perfectly reasonable for Cowen to be interested in those people who “hate or resent business, and they love to criticize it, mock it, and lower its status.” We should just remember that, at least from these survey data, it seems that this is a small minority of people.

Why did I write this post?

The bigger point here is that this is an example of something I see a lot, which is a social scientist or pundit coming up with theories to explain some empirical pattern in the world, but it turns out the pattern isn’t actually real. This came up years ago with Red State Blue State, when I noticed journalists coming up with explanations for voting patterns that were not happening (see for example here) and of course it comes up a lot with noise-mining research, whether it be a psychologist coming up with theories to explain ESP, or a sociologist coming up with theories to explain spurious patterns in sex ratios.

It’s fine to explain data; it’s just important to be aware of what’s being explained. In the context of the above-linked Cowen post, it’s fine to answer the question, “If business is so good, why is it so disliked?”—as long as this sentence is completed as follows: “If business is so good, why is it so disliked by a minority of Americans?” Explaining minority positions is important; we should just be clear it’s a minority.

Or of course it’s possible that Cowen has access to other data I haven’t looked at, perhaps more recent surveys that would modify my empirical understanding. That would be fine too.

P.S. The title of this post was originally “Most Americans like big business.” I changed the last word to “businesses” in response to comments who pointed out that most Americans express negative views about “big business” in general, but they like most individual big businesses that they’re asked about.

Continue Reading…


Read More

Data Science for Decision Makers: A Discussion with Dr Stelios Kampakis

This article contains an interview veteran data scientist, Dr Stylianos (Stelios) Kampakis, in which he discusses his career, and how he helps decision makers across a range of businesses understand how data science can benefit them.

Continue Reading…


Read More

The Stages of Relationships, Distributed

Everyone's relationship timeline is a little different. This animation plays out real-life paths to marriage. Read More

Continue Reading…


Read More

Four short links: 26 March 2019

Software Stack, Gig Economy, Simple Over Flexible, and Packet Radio

  1. Thoughts on Conway's Law and the Software Stack (Jessie Frazelle) -- All these problems are not small by any means. They are miscommunications at various layers of the stack. They are people thinking an interface or feature is secure when it is merely a window dressing that can be bypassed with just a bit more knowledge about the stack. I really like the advice Lea Kissner gave: “take the long view, not just the broad view.” We should do this more often when building systems.
  2. Troubles with the Open Source Gig Economy and Sustainability Tip Jar (Chris Aniszczyk) -- thoughtful long essay with a lot of links for background reading, on the challenges of sustainability via Patreon, etc., through to some signs of possibly-working models.
  3. Choose Simple Solutions Over Flexible Ones -- flexibility does not come for free.
  4. New Packet Radio (Hackaday) -- a custom radio protocol, designed to transport bidirectional IP traffic over 430MHz radio links (ham radio). This protocol is optimized for "point to multipoint" topology, with the help of managed-TDMA. Note that Hacker News commentors indicate some possible FCC violations; though, as the project comes from France, that's probably not a problem for the creators of the software.

Continue reading Four short links: 26 March 2019.

Continue Reading…


Read More

Whats new on arXiv

Knowledge Graph Development for App Store Data Modeling

These days usage of mobile applications has become quite a normal part of our lives, since every day we use our smartphones for communication, entertainment, business and even education. A high demand on various apps has led to significant growth of supply. Large number of apps offered, in turn, has led to complications in user’s search of the one perfectly suitable application. In this paper the authors have made an attempt to solve the problem of facilitating the search in app stores. With the help of a websites crawling software a sample of data was retrieved from one of the well-known mobile app stores and divided into 11 groups by types. Afterwards these groups of data were used to construct a Knowledge Schema – a graphic model of interconnections of data that characterize any mobile app in the selected store. This Schema creation is the first step in the process of developing a Knowledge Graph that will perform applications grouping to facilitate users search in app stores.

AttoNets: Compact and Efficient Deep Neural Networks for the Edge via Human-Machine Collaborative Design

While deep neural networks have achieved state-of-the-art performance across a large number of complex tasks, it remains a big challenge to deploy such networks for practical, on-device edge scenarios such as on mobile devices, consumer devices, drones, and vehicles. In this study, we take a deeper exploration into a human-machine collaborative design approach for creating highly efficient deep neural networks through a synergy between principled network design prototyping and machine-driven design exploration. The efficacy of human-machine collaborative design is demonstrated through the creation of AttoNets, a family of highly efficient deep neural networks for on-device edge deep learning. Each AttoNet possesses a human-specified network-level macro-architecture comprising of custom modules with unique machine-designed module-level macro-architecture and micro-architecture designs, all driven by human-specified design requirements. Experimental results for the task of object recognition showed that the AttoNets created via human-machine collaborative design has significantly fewer parameters and computational costs than state-of-the-art networks designed for efficiency while achieving noticeably higher accuracy (with the smallest AttoNet achieving ~1.8% higher accuracy while requiring ~10x fewer multiply-add operations and parameters than MobileNet-V1). Furthermore, the efficacy of the AttoNets is demonstrated for the task of instance-level object segmentation and object detection, where an AttoNet-based Mask R-CNN network was constructed with significantly fewer parameters and computational costs (~5x fewer multiply-add operations and ~2x fewer parameters) than a ResNet-50 based Mask R-CNN network.

Adaptive Strategies For Efficient Model Reduction In High-Dimensional Inverse Problems

This work explores a novel approach for adaptive, differentiable parametrization of large-scale non-stationary random fields. Coupled with any gradient-based algorithm, the method can be applied to variety of optimization problems, including history matching. The developed technique is based on principal component analysis (PCA), but, in contrast to other PCA-based methods, allows to amend parametrization process regarding objective function behaviour.

QATM: Quality-Aware Template Matching For Deep Learning

Finding a template in a search image is one of the core problems many computer vision, such as semantic image semantic, image-to-GPS verification \etc. We propose a novel quality-aware template matching method, QATM, which is not only used as a standalone template matching algorithm, but also a trainable layer that can be easily embedded into any deep neural network. Specifically, we assess the quality of a matching pair using soft-ranking among all matching pairs, and thus different matching scenarios such as 1-to-1, 1-to-many, and many-to-many will be all reflected to different values. Our extensive evaluation on classic template matching benchmarks and deep learning tasks demonstrate the effectiveness of QATM. It not only outperforms state-of-the-art template matching methods when used alone, but also largely improves existing deep network solutions.

Graph Convolutional Label Noise Cleaner: Train a Plug-and-play Action Classifier for Anomaly Detection

Video anomaly detection under weak labels is formulated as a typical multiple-instance learning problem in previous works. In this paper, we provide a new perspective, i.e., a supervised learning task under noisy labels. In such a viewpoint, as long as cleaning away label noise, we can directly apply fully supervised action classifiers to weakly supervised anomaly detection, and take maximum advantage of these well-developed classifiers. For this purpose, we devise a graph convolutional network to correct noisy labels. Based upon feature similarity and temporal consistency, our network propagates supervisory signals from high-confidence snippets to low-confidence ones. In this manner, the network is capable of providing cleaned supervision for action classifiers. During the test phase, we only need to obtain snippet-wise predictions from the action classifier without any extra post-processing. Extensive experiments on 3 datasets at different scales with 2 types of action classifiers demonstrate the efficacy of our method. Remarkably, we obtain the frame-level AUC score of 82.12% on UCF-Crime.

A novel quantum grid search algorithm and its application

In this paper we present a novel quantum algorithm, namely the quantum grid search algorithm, to solve a special search problem. Suppose k non-empty buckets are given, such that each bucket contains some marked and some unmarked items. In one trial an item is selected from each of the k buckets. If every selected item is a marked item, then the search is considered successful. This search problem can also be formulated as the problem of finding a ‘marked path’ associated with specified bounds on a discrete grid. Our algorithm essentially uses several Grover search operators in parallel to efficiently solve such problems. We also present an extension of our algorithm combined with a binary search algorithm in order to efficiently solve global trajectory optimization problems. Estimates of the expected run times of the algorithms are also presented, and it is proved that our proposed algorithms offer exponential improvement over pure classical search algorithms, while a traditional Grover’s search algorithm offers only a quadratic speedup. We note that this gain comes at the cost of increased complexity of the quantum circuitry. The implication of such exponential gains in performance is that many high dimensional optimization problems, which are intractable for classical computers, can be efficiently solved by our proposed quantum grid search algorithm.

Advanced Capsule Networks via Context Awareness

Capsule Networks (CN) offer new architectures for Deep Learning (DL) community. Though demonstrated its effectiveness on MNIST and smallNORB datasets, the networks still face a lot of challenges on other datasets for images with different levels of background. In this research, we improve the design of CN (Vector version) and perform experiments to compare accuracy and speed of CN versus DL models. In CN, we resort to more Pooling layers to filter Input images and extend Reconstruction layers to make better image restoration. In DL models, we utilize Inception V3 and DenseNet V201 for demanding computers beside NASNet, MobileNet V1 and MobileNet V2 for small and embedded devices. We evaluate our models on a fingerspelling alphabet dataset from American Sign Language (ASL). The results show that CNs perform comparably to DL models while dramatically reduce training time. We also make a demonstration for the purpose of illustration.

An Effective Label Noise Model for DNN Text Classification

Because large, human-annotated datasets suffer from labeling errors, it is crucial to be able to train deep neural networks in the presence of label noise. While training image classification models with label noise have received much attention, training text classification models have not. In this paper, we propose an approach to training deep networks that is robust to label noise. This approach introduces a non-linear processing layer (noise model) that models the statistics of the label noise into a convolutional neural network (CNN) architecture. The noise model and the CNN weights are learned jointly from noisy training data, which prevents the model from overfitting to erroneous labels. Through extensive experiments on several text classification datasets, we show that this approach enables the CNN to learn better sentence representations and is robust even to extreme label noise. We find that proper initialization and regularization of this noise model is critical. Further, by contrast to results focusing on large batch sizes for mitigating label noise for image classification, we find that altering the batch size does not have much effect on classification performance.

A Comparison of Prediction Algorithms and Nexting for Short Term Weather Forecasts

This report first provides a brief overview of a number of supervised learning algorithms for regression tasks. Among those are neural networks, regression trees, and the recently introduced Nexting. Nexting has been presented in the context of reinforcement learning where it was used to predict a large number of signals at different timescales. In the second half of this report, we apply the algorithms to historical weather data in order to evaluate their suitability to forecast a local weather trend. Our experiments did not identify one clearly preferable method, but rather show that choosing an appropriate algorithm depends on the available side information. For slowly varying signals and a proficient number of training samples, Nexting achieved good results in the studied cases.

Extrapolating paths with graph neural networks

We consider the problem of path inference: given a path prefix, i.e., a partially observed sequence of nodes in a graph, we want to predict which nodes are in the missing suffix. In particular, we focus on natural paths occurring as a by-product of the interaction of an agent with a network—a driver on the transportation network, an information seeker in Wikipedia, or a client in an online shop. Our interest is sparked by the realization that, in contrast to shortest-path problems, natural paths are usually not optimal in any graph-theoretic sense, but might still follow predictable patterns. Our main contribution is a graph neural network called Gretel. Conditioned on a path prefix, this network can efficiently extrapolate path suffixes, evaluate path likelihood, and sample from the future path distribution. Our experiments with GPS traces on a road network and user-navigation paths in Wikipedia confirm that Gretel is able to adapt to graphs with very different properties, while also comparing favorably to previous solutions.

LYRICS: a General Interface Layer to Integrate AI and Deep Learning

In spite of the amazing results obtained by deep learning in many applications, a real intelligent behavior of an agent acting in a complex environment is likely to require some kind of higher-level symbolic inference. Therefore, there is a clear need for the definition of a general and tight integration between low-level tasks, processing sensorial data that can be effectively elaborated using deep learning techniques, and the logic reasoning that allows humans to take decisions in complex environments. This paper presents LYRICS, a generic interface layer for AI, which is implemented in TersorFlow (TF). LYRICS provides an input language that allows to define arbitrary First Order Logic (FOL) background knowledge. The predicates and functions of the FOL knowledge can be bound to any TF computational graph, and the formulas are converted into a set of real-valued constraints, which participate to the overall optimization problem. This allows to learn the weights of the learners, under the constraints imposed by the prior knowledge. The framework is extremely general as it imposes no restrictions in terms of which models or knowledge can be integrated. In this paper, we show the generality of the approach showing some use cases of the presented language, including generative models, logic reasoning, model checking and supervised learning.

Short Datathon for the Interdisciplinary Development of Data Analysis and Visualization Skills

Understanding the major fraud problems in the world and interpreting the data available for analysis is a current challenge that requires interdisciplinary knowledge to complement the knowledge of computer professionals. Collaborative events (called Hackathons, Datathons, Codefests, Hack Days, etc.) have become relevant in several fields. Examples of fields which are explored in these events include startup development, open civic innovation, corporate innovation, and social issues. These events have features that favor knowledge exchange to solve challenges. In this paper, we present an event format called Short Datathon, a Hackathon for the development of exploratory data analysis and visualization skills. Our goal is to evaluate if participating in a Short Datathon can help participants learn basic data analysis and visualization concepts. We evaluated the Short Datathon in two case studies, with a total of 20 participants, carried out at the Federal University of Technology – Paran\’a. In both case studies we addressed the issue of tax evasion using real world data. We describe, as a result of this work, the qualitative aspects of the case studies and the perception of the participants obtained through questionnaires. Participants stated that the event helped them understand more about data analysis and visualization and that the experience with people from other areas during the event made data analysis more efficient. Further studies are necessary to evolve the format of the event and to evaluate its effectiveness.

MediaRank: Computational Ranking of Online News Sources

In the recent political climate, the topic of news quality has drawn attention both from the public and the academic communities. The growing distrust of traditional news media makes it harder to find a common base of accepted truth. In this work, we design and build MediaRank (, a fully automated system to rank over 50,000 online news sources around the world. MediaRank collects and analyzes one million news webpages and two million related tweets everyday. We base our algorithmic analysis on four properties journalists have established to be associated with reporting quality: peer reputation, reporting bias / breadth, bottomline financial pressure, and popularity. Our major contributions of this paper include: (i) Open, interpretable quality rankings for over 50,000 of the world’s major news sources. Our rankings are validated against 35 published news rankings, including French, German, Russian, and Spanish language sources. MediaRank scores correlate positively with 34 of 35 of these expert rankings. (ii) New computational methods for measuring influence and bottomline pressure. To the best of our knowledge, we are the first to study the large-scale news reporting citation graph in-depth. We also propose new ways to measure the aggressiveness of advertisements and identify social bots, establishing a connection between both types of bad behavior. (iii) Analyzing the effect of media source bias and significance. We prove that news sources cite others despite different political views in accord with quality measures. However, in four English-speaking countries (US, UK, Canada, and Australia), the highest ranking sources all disproportionately favor left-wing parties, even when the majority of news sources exhibited conservative slants.

Combining Model and Parameter Uncertainty in Bayesian Neural Networks

Bayesian neural networks (BNNs) have recently regained a significant amount of attention in the deep learning community due to the development of scalable approximate Bayesian inference techniques. There are several advantages of using Bayesian approach: Parameter and prediction uncertainty become easily available, facilitating rigid statistical analysis. Furthermore, prior knowledge can be incorporated. However so far there have been no scalable techniques capable of combining both model (structural) and parameter uncertainty. In this paper we introduce the concept of model uncertainty in BNNs and hence make inference in the joint space of models and parameters. Moreover, we suggest an adaptation of a scalable variational inference approach with reparametrization of marginal inclusion probabilities to incorporate the model space constraints. Finally, we show that incorporating model uncertainty via Bayesian model averaging and Bayesian model selection allows to drastically sparsify the structure of BNNs without significant loss of predictive power.

Multi-Differential Fairness Auditor for Black Box Classifiers

Machine learning algorithms are increasingly involved in sensitive decision-making process with adversarial implications on individuals. This paper presents mdfa, an approach that identifies the characteristics of the victims of a classifier’s discrimination. We measure discrimination as a violation of multi-differential fairness. Multi-differential fairness is a guarantee that a black box classifier’s outcomes do not leak information on the sensitive attributes of a small group of individuals. We reduce the problem of identifying worst-case violations to matching distributions and predicting where sensitive attributes and classifier’s outcomes coincide. We apply mdfa to a recidivism risk assessment classifier and demonstrate that individuals identified as African-American with little criminal history are three-times more likely to be considered at high risk of violent recidivism than similar individuals but not African-American.

Continue Reading…


Read More

Visualizing the 80/20 rule, with the bar-density plot

Through Twitter, Danny H. submitted the following chart that shows a tiny 0.3 percent of Youtube creators generate almost 40 percent of all viewing on the platform. He asks for ideas about how to present lop-sided data that follow the "80/20" rule.


In the classic 80/20 rule, 20 percent of the units account for 80 percent of the data. The percentages vary, so long as the first number is small relative to the second. In the Youtube example, 0.3 percent is compared to 40 percent. The underlying reason for such lop-sidedness is the differential importance of the units. The top units are much more important than the bottom units, as measured by their contribution to the data.

I sense a bit of "loss aversion" on this chart (explained here). The designer color-coded the views data into blue, brown and gray but didn't have it in him/her to throw out the sub-categories, which slows down cognition and adds hardly to our understanding.

I like the chart title that explains what it is about.

Turning to the D corner of the Trifecta Checkup for a moment, I suspect that this chart only counts videos that have at least one play. (Zero-play videos do not show up in a play log.) For a site like Youtube, a large proportion of uploaded videos have no views and thus, many creators also have no views.


My initial reaction on Twitter is to use a mirrored bar chart, like this:


I ended up spending quite a bit of time exploring other concepts. In particular, I like to find an integrated way to present this information. Most charts, such as the mirrored bar chart, a Bumps chart (slopegraph), and Lorenz chart, keep the two series of percentages separate.

Also, the biggest bar (the gray bar showing 97% of all creators) highlights the least important Youtubers while the top creators ("super-creators") are cramped inside a slither of a bar, which is invisible in the original chart.

What I came up with is a bar-density plot, where I use density to encode the importance of creators, and bar lengths to encode the distribution of views.


Each bar is divided into pieces, with the number of pieces proportional to the number of creators in each segment. This has the happy result that the super-creators are represented by large (red) pieces while the least important creators by little (gray) pieces.

The embedded tessellation shows the structure of the data: the bottom third of the views are generated by a huge number of creators, producing a few views each - resulting in a high density. The top 38% of the views correspond to a small number of super-creators - appropriately shown by a bar of low density.

For those interested in technicalities, I embed a Voronoi diagram inside each bar, with randomly placed points. (There will be a companion post later this week with some more details, and R code.)

Here is what the bar-density plot looks like when the distribution is essentially uniform:

The density inside each bar is roughly the same, indicating that the creators are roughly equally important.


P.S. The next post on the bar-density plot, with some experimental R code, will be available here.






Continue Reading…


Read More

Bar-density and pie-density plots for showing relative proportions

In my last post, I described a bar-density chart to show paired data of proportions with an 80/20-type rule. The following example illustrates that a small proportion of Youtubers generate a large proportion of views.


Other examples of this type of data include:

  • the top 10% of families own 75% of U.S. household wealth (link)
  • the top 1% of artists earn 77% of recorded music income (link)
  • Five percent of AT&T customers consume 46% of the bandwidth (link)

In all these examples, the message of the data is the importance of a small number of people (top earners, superstars, bandwidth hogs). A good visual should call out this message.

The bar-density plot consists of two components:

  • the bar chart which shows the distribution of the data (views, wealth, income, bandwidth) among segments of people;
  • The embedded Voronoi diagram within each bar that encodes the relative importance of each people segment, as measured by the (inverse) density of the population among these segments - a people segment is more important if each individual accounts for more of the data, or in other words, the density of people within the group is lower.

The bar chart can adopt a more conventional horizontal layout.


Voronoi tessellation

To understand the Voronoi diagram, think of a fixed number (say, 100) of randomly placed points inside a bar. Then, for any point inside the bar area, it has a nearest neighbor among those 100 fixed points. Assign every point on the surface to its nearest neighbor. From this, one can draw a boundary around each of the 100 points to include all its nearest neighbors. The resulting tessellation is the Voronoi diagram. (The following illustration comes from this AMS column.)



The density of points in the respective bars encodes the relative proportions of people within those groups. For my example, I placed 6 points in the red bar, 666 points in the yellow bar, and ~2000 points in the gray bar, which precisely represents the relative proportions of creators in the three segments.

Density is represented statistically

Notice that the density is represented statistically, not empirically. According to the annotation on the original chart, the red bar represents 14,000 super-creators. Correspondingly, there are 4.5 million creators in the gray bar. Any attempt to plot those as individual pieces will result in a much less impactful graphic. If the representation is interpreted statistically, as relative densities within each people segment, the message of relative importance of the units within each group is appropriately conveyed.

A more sophisticated way of deciding how many points to place in the red bar is to be developed. Here, I just used the convenient number of 6.

The color shades are randomly applied to the tessellation pieces, and used to facilitate reading of densities.


In this section, I provide R code for those who want to explore this some more. This is code used for prototyping, and you're welcome to improve them. The general strategy is as follows:

  • Set the rectangular area (bar) in which the Voronoi diagram is to be embedded. The length of the bar is set to the proportion of views, appropriately scaled. The code utilizes the dirichlet function within the spatstat package to generate the fixed points; this requires setting up the owin parameter to represent a rectangle.
  • Set the number of points (n) to be embedded in the bar, determined by the relative proportion of creators, appropriately scaled. Generate a data frame containing the x-y coordinates of n randomly placed points, within the rectangle defined above.
  • Use the ppp function to generate the Voronoi data
  • Set up a colormap for plotting the Voronoi diagram
  • Plot the Voronoi diagram; assign shades at random to the pieces (in a production code, these random numbers should be set as marks in the ppp but it's easier to play around with the shades if placed here)

The code generates separate charts for each bar segment. A post-processing step is currently required to align the bars to attain equal height. I haven't figured out whether the multiplot option helps here.


# enter the scaled proportions of creators and views
# the Youtube example has three creator segments

# number of randomly generated points should be proportional to proportion of creators. Multiply nc by a scaling factor if desired

nc = c(3, 33, 965)*2

# bar widths should be proportional to proportion of views
# total width should be set based on the width of your page

wide = c(378, 276, 346)/2

# set bar height, to attain a particular aspect ratio
bar_h = 50

# define function to generate points
# defines rectangular window

makepoints = function (n, wide, height) {
    df <- data.frame(x = runif(n,0,wide),y = runif(n,0,height))
    W <- owin( c(0, wide), c(0,height) ) # rectangular window
    pp1 <- as.ppp( df, W )
    y <- dirichlet(pp1)
    # y$marks <- sample(0:wide, n, replace=T) # marks are for colors
    return (y)

y_red = makepoints(nc[1], wide[1], bar_h) # height of each bar fixed
y_yel = makepoints(nc[2], wide[2], bar_h)
y_gry = makepoints(nc[3], wide[3], bar_h)

# setting colors (4 shades per bar, one color per bar)

cr_red = colourmap(c("lightsalmon","lightsalmon2", "lightsalmon4", "brown"), breaks=round(seq(0, wide[1],length.out=5)))

cr_yel = colourmap(c("burlywood1", "burlywood2", "burlywood3", "burlywood4"), breaks=round(seq(0, wide[2],length.out=5)))

cr_gry = colourmap(c("gray80", "gray60", "gray40", "gray20"), breaks=round(seq(0, wide[3],length.out=5)))

# plotting


# add png to save image to png

# remove values= if colors set in ppp

plot.tess(y_red, main="", border="pink3", do.col=T, values = sample(0:wide[1], nc[1], replace=T), col=cr_red, xlim=c(0, wide[1]), ylim=c(0,bar_h), ribbon=F)

plot.tess(y_yel, main="", border="darkgoldenrod4", do.col=T, values=sample(0:wide[2], nc[2], replace=T), col=cr_yel, xlim=c(0, wide[2]), ylim=c(0,bar_h), ribbon=F)

plot.tess(y_gry, main="", border="darkgray", do.col=T, values=sample(0:wide[3], nc[3], replace=T), col=cr_gry, xlim=c(0, wide[3]), ylim=c(0,bar_h), ribbon=F)

# because of random points, the tessellation looks different each time
# post-processing: make each bar the same height when aligned side by side


A cousin of the bar-density plot is the pie-density plot. Since I'm using only three creator segments, which each account for about 30-40% of the total views, it is natural to use a pie chart. In this case, we embed the Voronoi diagrams into the pie sectors.


If the distribution were more even, that is to say, the creators are more or less equally important, the pie-density plot looks like this:



Something that is more like 80/20

The original chart shows the top 0.3 percent generating almost 40 percent of the views. A more typical insight is top X percent generates 80 percent of the data. For the YouTube data, X is 11 percent. What does the pie-density chart look like if  top 11 percent <-> 80 percent, middle 33 percent <-> 11 percent, bottom 56 percent <-> 8 percent?


Roughly speaking, the second segment includes 3 times the people as the largest, and the third has 5 times as the largest.










Continue Reading…


Read More

R Packages worth a look

Tabler’ API for ‘Shiny’ (tablerDash)
R’ interface to the ‘Tabler’ HTML template. See more here <>. ‘tablerDash’ is …

Superfast Likelihood Inference for Stationary Gaussian Time Series (SuperGauss)
Likelihood evaluations for stationary Gaussian time series are typically obtained via the Durbin-Levinson algorithm, which scales as O(n^2) in the numb …

Design and Analyze Studies using the Sequential Parallel Comparison Design (SPCDAnalyze)
Programs to find the sample size or power of studies using the Sequential Parallel Comparison Design (SPCD) and programs to analyze such studies. This …

Tools for Acquiring and Analyzing Political Data (politicaldata)
Provides for useful functions to obtain commonly-used data in political analysis and political science, including from sources such as the Comparative …

Temporal Contributions on Trends using Mixed Models (TempCont)
Method to estimate the effect of the trend in predictor variables on the observed trend of the response variable using mixed models with temporal autoc …

Dynamic Functional Connectivity Analysis (dfConn)
An implementation of multivariate linear process bootstrap (MLPB) method and sliding window technique to assess the dynamic functional connectivity (dF …

Continue Reading…


Read More

Use RStudio Server in a Virtual Environment with Docker in Minutes!

(This article was first published on r-bloggers – Telethon Kids Institute, and kindly contributed to R-bloggers)

A fundamental aspect of the reproducible research framework is that (statistical) analysis can be reproduced; that is, given a set of instructions (or a script file) the exact results can be achieved by another analyst with the same raw data. This idea may seem intuitive, but in practice it can be difficult to achieve in an analytical environment that is always evolving and changing.

For example, Tidyverse users will have recently seen the following warning in R:

Warning message:
'data_frame()' is depreciated, use 'tibble()'.

data_frame(a = 1:3)

This is because dplyr has changed, hopefully for the better, and tibble() is now the preferred route. It is conceivable that in a future release the data_frame() function will no longer be a part of dplyr at all.

So, what does this have to do with reproducible research? Say you want to come back to your analysis in 5 years and re-run your old code. Your future installation of R with the latest Tidyverse installation may not be backwards compatible with your now ancient old code and the analysis will crash.

There are a couple of strategies that we could use to deal with updating dependencies. The first, and possibly the easiest, is to use devtools to install older package versions (see here for further details). But what if the version of the package you used is not archived on CRAN? For example, the analysis could have used a package from GitHub that has since changed and the maintainer hasn’t used systematic releases for you to pull from. Thus, this strategy is quite likely to fail.

Another solution is to put your entire project inside a static virtual environment that is isolated from the rest of your machine. The benefit of project isolation is that any dependencies that are installed within the environment can be made persistent, but are also separated from any dependency upgrades that might be applied to your host machine or to other projects.

Docker isn’t a new topic for regular R-Bloggers readers, but for those of you that are unfamiliar: Docker is a program that uses virtual containers, which isolate and bundle applications. The containers are defined by a set of instructions detailed in a docker-compose.yml or Dockerfile and can effectively be stacked to call upon the capabilities of base images. Furthermore, one of the fundamental tenets of Docker is portability – if a container will work on your local machine then it will work anywhere. And because base images are versioned, they will work for all time. This is also great for scalability onto servers and distribution to colleagues and collaborators who will see exactly what you have prepared regardless of their host operating system.

Our dockerised RStudio environment.

Our group at the Telethon Kids Institute has prepared source code to launch a dockerised RStudio Server instance via a NGINX reverse proxy enabled for HTTPS connections. Our GitHub repository provides everything you need to quickly establish a new project (just clone the repository) with the features of RStudio Web Server with the added benefit of SSL encryption (this will need some local configuration); even though RStudio Server is password protected, web encryption is still important for us as we routinely deal with individual level health data.

(Important note, even with data security measures in place, our policy is for patient data to be de-identified prior to analysis as per good clinical practice norms and, except for exceptional circumstances, we would use our intranet to further restrict access).

The defining docker-compose.yml that we use can be found at out our GitHub site via this link: We used an RStudio base image with Tidyverse pre-installed with R 3.5.3, which is maintained by rocker at As updates are made to the base-image, the repository will be updated with new releases that provide an opportunity to re-install a specific version of the virtual environment. It has also been set up with persistent volumes for the projects, rstudio, and lib/R directories to keep any changes made to the virtual environment for back-up and further version control.

By combining tools like Docker and Git we believe we can refine and make common place, within our institute and those we collaborate with, a culture of reproducible research as we conduct world class research to improve the lives of sick children.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – Telethon Kids Institute. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

If you did not already know

Manifold Adversarial Training (MAT) google
The recently proposed adversarial training methods show the robustness to both adversarial and original examples and achieve state-of-the-art results in supervised and semi-supervised learning. All the existing adversarial training methods consider only how the worst perturbed examples (i.e., adversarial examples) could affect the model output. Despite their success, we argue that such setting may be in lack of generalization, since the output space (or label space) is apparently less informative. In this paper, we propose a novel method, called Manifold Adversarial Training (MAT). MAT manages to build an adversarial framework based on how the worst perturbation could affect the distributional manifold rather than the output space. Particularly, a latent data space with the Gaussian Mixture Model (GMM) will be first derived. On one hand, MAT tries to perturb the input samples in the way that would rough the distributional manifold the worst. On the other hand, the deep learning model is trained trying to promote in the latent space the manifold smoothness, measured by the variation of Gaussian mixtures (given the local perturbation around the data point). Importantly, since the latent space is more informative than the output space, the proposed MAT can learn better a robust and compact data representation, leading to further performance improvement. The proposed MAT is important in that it can be considered as a superset of one recently-proposed discriminative feature learning approach called center loss. We conducted a series of experiments in both supervised and semi-supervised learning on three benchmark data sets, showing that the proposed MAT can achieve remarkable performance, much better than those of the state-of-the-art adversarial approaches. …

RMSProp+AF google
Source localization is of pivotal importance in several areas such as wireless sensor networks and Internet of Things (IoT), where the location information can be used for a variety of purposes, e.g. surveillance, monitoring, tracking, etc. Time Difference of Arrival (TDOA) is one of the well-known localization approaches where the source broadcasts a signal and a number of receivers record the arriving time of the transmitted signal. By means of computing the time difference from various receivers, the source location can be estimated. On the other hand, in the recent few years novel optimization algorithms have appeared in the literature for $(i)$ processing big data and for $(ii)$ training deep neural networks. Most of these techniques are enhanced variants of the classical stochastic gradient descent (SGD) but with additional features that promote faster convergence. In this paper, we compare the performance of the classical SGD with the novel techniques mentioned above. In addition, we propose an optimization procedure called RMSProp+AF, which is based on RMSProp algorithm but with the advantage of incorporating adaptation of the decaying factor. We show through simulations that all of these techniques—which are commonly used in the machine learning domain—can also be successfully applied to signal processing problems and are capable of attaining improved convergence and stability. Finally, it is also shown through simulations that the proposed method can outperform other competing approaches as both its convergence and stability are superior. …

Cross-Domain Latent Feature Mapping (CDLFM) google
Collaborative Filtering (CF) is a widely adopted technique in recommender systems. Traditional CF models mainly focus on predicting a user’s preference to the items in a single domain such as the movie domain or the music domain. A major challenge for such models is the data sparsity problem, and especially, CF cannot make accurate predictions for the cold-start users who have no ratings at all. Although Cross-Domain Collaborative Filtering (CDCF) is proposed for effectively transferring users’ rating preference across different domains, it is still difficult for existing CDCF models to tackle the cold-start users in the target domain due to the extreme data sparsity. In this paper, we propose a Cross-Domain Latent Feature Mapping (CDLFM) model for cold-start users in the target domain. Firstly, in order to better characterize users in sparse domains, we take the users’ similarity relationship on rating behaviors into consideration and propose the Matrix Factorization by incorporating User Similarities (MFUS) in which three similarity measures are proposed. Next, to perform knowledge transfer across domains, we propose a neighborhood based gradient boosting trees method to learn the cross-domain user latent feature mapping function. For each cold-start user, we learn his/her feature mapping function based on the latent feature pairs of those linked users who have similar rating behaviors with the cold-start user in the auxiliary domain. And the preference of the cold-start user in the target domain can be predicted based on the mapping function and his/her latent features in the auxiliary domain. Experimental results on two real data sets extracted from Amazon transaction data demonstrate the superiority of our proposed model against other state-of-the-art methods. …

Continue Reading…


Read More

Magister Dixit

“There are two kinds of people who violate the rules of statistical inference: people who don’t know them and people who don’t agree with them.” Allen Downey ( December 8, 2015 )

Continue Reading…


Read More

March 25, 2019

Book Memo: “Asymptotic Nonparametric Statistical Analysis of Stationary Time Series”

Stationarity is a very general, qualitative assumption, that can be assessed on the basis of application specifics. It is thus a rather attractive assumption to base statistical analysis on, especially for problems for which less general qualitative assumptions, such as independence or finite memory, clearly fail. However, it has long been considered too general to be able to make statistical inference. One of the reasons for this is that rates of convergence, even of frequencies to the mean, are not available under this assumption alone. Recently, it has been shown that, while some natural and simple problems, such as homogeneity, are indeed provably impossible to solve if one only assumes that the data is stationary (or stationary ergodic), many others can be solved with rather simple and intuitive algorithms. The latter include clustering and change point estimation among others. In this volume I summarize these results. The emphasis is on asymptotic consistency, since this the strongest property one can obtain assuming stationarity alone. While for most of the problem for which a solution is found this solution is algorithmically realizable, the main objective in this area of research, the objective which is only partially attained, is to understand what is possible and what is not possible to do for stationary time series. The considered problems include homogeneity testing (the so-called two sample problem), clustering with respect to distribution, clustering with respect to independence, change point estimation, identity testing, and the general problem of composite hypotheses testing. For the latter problem, a topological criterion for the existence of a consistent test is presented. In addition, a number of open problems is presented.

Continue Reading…


Read More

Distilled News

The 3 Best Optimization Methods in Neural Networks

Deep learning is an iterative process. With so many parameters to tune or methods to try, it is important to be able to train models fast, in order to quickly complete the iterative cycle. This is key to increasing the speed and efficiency of a machine learning team. Hence the importance of optimization algorithms such as stochastic gradient descent, min-batch gradient descent, gradient descent with momentum and the Adam optimizer. These methods make it possible for our neural network to learn. However, some methods perform better than others in terms of speed. Here, you will learn about the best alternatives to stochastic gradient descent and we will implement each method to see how fast a neural network can learn using each method.

Machine learning interpretability techniques

Most machine learning systems require the ability to explain to stakeholders why certain predictions are made. When choosing a suitable machine learning model, we often think in terms of the accuracy vs. interpretability trade-off:
• accurate and ‘black-box’:
Black-box models such as neural networks, gradient boosting models or complicated ensembles often provide great accuracy. The inner workings of these models are harder to understand and they don’t provide an estimate of the importance of each feature on the model predictions, nor is it easy to understand how the different features interact.
• weaker and ‘white-box’:
Simpler models such as linear regression and decision trees on the other hand provide less predictive capacity and are not always capable of modelling the inherent complexity of the dataset (i.e. feature interactions). They are however significantly easier to explain and interpret.

Statistical Tests for Comparing Machine Learning and Baseline Performance

When comparing a machine learning approach with the current solution, I wish to understand if any observed difference is statistically significant; that it is unlikely to be simply due to chance or noise in the data. The appropriate test to evaluate statistical significance varies depending on what your machine learning model is predicting, the distribution of your data, and whether or not you’re comparing predictions on the subjects. This post highlights common tests and where they are suitable.

Loss functions based on feature activation and style loss.

Loss functions using these techniques can be used during the training of U-Net based model architectures and could be applied to the training of other Convolutional Neural Networks that are generating an image as their predication/output. I’ve separated this out from my article on Super Resolution (https://…solution-without-using-a-gan-11c9bb5b6cd5 ), to be more generic as I am using similar loss functions on other U-Net based models making predictions on image data. Having this separated makes it easier to reference and keeps my other articles easier to understand. This is based on the techniques demonstrated and taught in the Fastai deep learning course. This loss function is partly based upon the research in the paper Losses for Real-Time Style Transfer and Super-Resolution and the improvements shown in the Fastai course (v3). This paper focuses on feature losses (called perceptual loss in the paper). The research did not use a U-Net architecture as the machine learning community were not aware of them at that time.

Advanced Keras – Accurately Resuming a Training Process

In this post I will present a use case of the Keras API in which resuming a training process from a loaded checkpoint needs to be handled differently than usual.

Automated Machine Learning: Myth Versus Realty

Witnessing the data science field’s meteoric rise in demand across pretty much all industries and areas of scientific research, it’s easy to anticipate efforts to create shortcuts to satisfy the need for more data science practitioners. The current trend of automated machine learning is a great case in point. This article will touch on a number of efforts to circumvent the need for data scientists to select and train machine learning models and determine metrics for measuring their performance.

Regularization techniques for Neural Networks

In our last post, we learned about feedforward neural networks and how to design them. In this post, we will learn how to tackle one of the most central problems that arise in the domain of machine learning, that is how to make our algorithm to find a perfect fit not only to the training set but also to the testing set. When an algorithm performs well on the training set but performs poorly on the testing set, the algorithm is said to be overfitted on the Training data. After all, our main goal is to perform well on never seen before data, ie reducing the overfitting. To tackle this problem we have to make our model generalize over the training data which is done using various regularization techniques which we will learn about in this post.

Harnessing Organizational Knowledge for Machine Learning

One of the biggest bottlenecks in developing machine learning (ML) applications is the need for the large, labeled datasets used to train modern ML models. Creating these datasets involves the investment of significant time and expense, requiring annotators with the right expertise. Moreover, due to the evolution of real-world applications, labeled datasets often need to be thrown out or re-labeled.

Infographic: What’s the Future of the Data Catalog?

The concept of data catalogs is one that’s becoming increasingly relevant to businesses. According to the McKinsey Global Institute, data-driven organizations are 19 times as likely to be profitable than businesses that aren’t focused on data. They’re also 23 times more likely to acquire customers and six times more likely to retain them.

The Evolved Transformer – Enhancing Transformer with Neural Architecture Search

Neural architecture search (NAS) is the process of algorithmically searching for new designs of neural networks. Though researchers have developed sophisticated architectures over the years, the ability to find the most efficient ones is limited, and recently NAS has reached the point where it can outperform human-designed models.

A Beginner’s Guide to Big Data and Blockchain

Over the last few years, blockchain has been one of the hottest areas of technology development across industries. It’s easy to see why. There seems to be no end to the myriad ways that forward-thinking businesses are finding. Furthermore, they are doing this to adapt the technology to suit a variety of use cases and applications. Much of the development, however, has come in one of two places. One is deep-pocket corporations and crypto-startups. That means that the latest in blockchain technology is out of reach for businesses in the small and midsize enterprise (SME) sector. This leads to creating something of a digital divide that seems to be widening every day. But, there are a few blockchain projects that promise to democratise the technology for SMEs. Furthermore, this could even do the same for Big Data and analytics, to boot. In this blog, we will explore the basics of both big data and blockchain. Furthermore, we will analyse the advantages of combining both big data and blockchain. In the end, we will have a look the applications in real-world and wrap up with predictions about blockchain in future!

Text Preprocessing Techniques

These techniques were used in comparison in our paper ‘A Comparison of Pre-processing Techniques for Twitter Sentiment Analysis’. If you use this material please cite the paper. An extended paper for this work can be found here, with the title ‘A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis’.
0. Remove Unicode Strings and Noise
1. Replace URLs, User Mentions and Hashtags
2. Replcae Slang and Abbreviations
3. Replace Contractions
4. Remove Numbers
5. Replace Repetitions of Punctuation
6. Replace Negations with Antonyms
7. Remove Punctuation
8. Handling Capitalized Words
9. Lowercase
10. Remove Stopwords
11. Replace Elongated Words
12. Spelling Correction
13. Part of Speech Tagging
14. Lemmatizing
15. Stemming

Chatbot – A real game changer in the industry of technologically advanced practices

As of now, chatbots are among the most trending technology for which the industry is excited to get in integrated. They get touted as the next rendition of applications, similar to an immense change in the correspondence business. Since Facebook has extended access to its messenger administration, it is enabling firms to achieve clients better through various APIs. Chatbots has turned into the favorite expression nowadays. Multiple inquiries are emerging about Chatbots: What the Chatbots are? How would they work? How are they get developed? Are Chatbots a first open door for organizations? It will be talked about here.

Factors Behind Data Storage Security: Is Your Business Vulnerable?

Is your business vulnerable to cybersecurity issues or attacks? Here’s what to know about the driving factors behind data storage security.

Building NLP Classifiers Cheaply With Transfer Learning and Weak Supervision

There is a catch to training state-of-the-art NLP models: their reliance on massive hand-labeled training sets. That’s why data labeling is usually the bottleneck in developing NLP applications and keeping them up-to-date. For example, imagine how much it would cost to pay medical specialists to label thousands of electronic health records. In general, having domain experts label thousands of examples is too expensive.

How to setup the PySpark environment for development, with good software engineering practices

In this article we will discuss about how to set up our development environment in order to create good quality python code and how to automate some of the tedious tasks to speed up deployments.
We will go over the following steps:
• setup our dependencies in a isolated virtual environment with pipenv
• how to setup a project structure for multiple jobs
• how to run a pyspark job
• how to use a Makefile to automate development steps
• how to test the quality of our code using flake8
• how to run unit tests for PySpark apps using pytest-spark
• running a test coverage, to see if we have created enough unit tests using pytest-cov

Continue Reading…


Read More

Markov chain Monte Carlo doesn’t “explore the posterior”

First some background, then the bad news, and finally the good news.

Spoiler alert: The bad news is that exploring the posterior is intractable; the good news is that we don’t need to explore all of it.

Sampling to characterize the posterior

There’s a misconception among Markov chain Monte Carlo (MCMC) practitioners that the purpose of sampling is to explore the posterior. For example, I’m writing up some reproducible notes on probability theory and statistics through sampling (in pseudocode with R implementations) and have just come to the point where I’ve introduced and implemented Metropolis and want to use it to exemplify convergence mmonitoring. So I did what any right-thinking student would do and borrowed one of my mentor’s diagrams (which is why this will look familiar if you’ve read the convergence monitoring section of Bayesian Data Analysis 3).

First M steps of of isotropic random-walk Metropolis with proposal scale normal(0, 0.2) targeting a bivariate normal with unit variance and 0.9 corelation. After 50 iterations, we haven’t found the typical set, but after 500 iterations we have. Then after 5000 iterations, everything seems to have mixed nicely through this two-dimensional example.

This two-dimensional traceplot gives the misleading impression that the goal is to make sure each chain has moved through the posterior. This low-dimensional thinking is nothing but a trap in higher dimensions. Don’t fall for it!

Bad news from higher dimensions

It’s simply intractable to “cover the posterior” in high dimensions. Consider a 20-dimensional standard normal distribution. There are 20 variables, each of which may be positive or negative, leading to a total of 2^{20}, or more than a million orthants (generalizations of quadrants). In 30 dimensions, that’s more than a billion. You get the picture—the number of orthant grows exponentially so we’ll never cover them all explicitly through sampling.

Good news in expectation

Bayesian inference is based on probability, which means integrating over the posterior density. This boils down to computing expectations of functions of parameters conditioned on data. This we can do.

For example, we can construct point estimates that minimize expected square error by using posterior means, which are just expectations conditioned on data, which are in turn integrals, which can be estimated via MCMC,

\begin{array}{rcl} \hat{\theta} & = & \mathbb{E}[\theta \mid y] \\[8pt] & = & \int_{\Theta} \theta \times p(\theta \mid y) \, \mbox{d}\theta. \\[8pt] & \approx & \frac{1}{M} \sum_{m=1}^M \theta^{(m)},\end{array}

where \theta^{(1)}, \ldots, \theta^{(M)} are draws from the posterior p(\theta \mid y).

If we want to calculate predictions, we do so by using sampling to calculate the integral required for the expectation,

p(\tilde{y} \mid y) \ = \ \mathbb{E}[p(\tilde{y} \mid \theta) \mid y] \ \approx \ \frac{1}{M} \sum_{m=1}^M p(\tilde{y} \mid \theta^{(m)}),

If we want to calculate event probabilities, it’s just the expectation of an indicator function, which we can calculate through sampling, e.g.,

\mbox{Pr}[\theta_1 > \theta_2] \ = \ \mathbb{E}\left[\mathrm{I}[\theta_1 > \theta_2] \mid y\right]  \ \approx \ \frac{1}{M} \sum_{m=1}^M \mathrm{I}[\theta_1^{(m)} > \theta_2^{(m)}].

The good news is that we don’t need to visit the entire posterior to compute these expectations to within a few decimal places of accuracy. Even so, MCMC isn’t magic—those two or three decimal places will be zeroes for tail probabilities.

Continue Reading…


Read More

Earn an IBM Data Science Certificate

IBM’s Data Science Professional Certificate program on Coursera brings you everything you need to plunge into an exciting career in data science—no prior experience required! Start learning today.

Continue Reading…


Read More

Scaling Big Data and AI – Spark + AI Summit 2019

Data and AI are all about scale. Databricks is bringing the Spark + AI Summit to San Francisco Apr 23-25. Check out the full list of sessions at Summit to see more exciting talks. Use code KDNuggets200 and get $200 off registration.

Continue Reading…


Read More

Document worth reading: “Taking Human out of Learning Applications: A Survey on Automated Machine Learning”

Machine learning techniques have deeply rooted in our everyday life. However, since it is knowledge- and labor-intensive to pursuit good learning performance, human experts are heavily engaged in every aspect of machine learning. In order to make machine learning techniques easier to apply and reduce the demand for experienced human experts, automatic machine learning~(AutoML) has emerged as a hot topic of both in industry and academy. In this paper, we provide a survey on existing AutoML works. First, we introduce and define the AutoML problem, with inspiration from both realms of automation and machine learning. Then, we propose a general AutoML framework that not only covers almost all existing approaches but also guides the design for new methods. Afterward, we categorize and review the existing works from two aspects, i.e., the problem setup and the employed techniques. Finally, we provide a detailed analysis of AutoML approaches and explain the reasons underneath their successful applications. We hope this survey can serve as not only an insightful guideline for AutoML beginners but also an inspiration for future researches. Taking Human out of Learning Applications: A Survey on Automated Machine Learning

Continue Reading…


Read More

What it the interpretation of the diagonal for a ROC curve

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

Last Friday, we discussed the use of ROC curves to describe the goodness of a classifier. I did say that I will post a brief paragraph on the interpretation of the diagonal. If you look around some say that it describes the “strategy of randomly guessing a class“, that it is obtained with “a diagnostic test that is no better than chance level“, even obtained by “making a prediction by tossing of an unbiased coin“.

Let us get back to ROC curves to illustrate those points. Consider a very simple dataset with 10 observations (that is not linearly separable)

x1 = c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85)
x2 = c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3)
y = c(1,1,1,1,1,0,0,1,0,0)
df = data.frame(x1=x1,x2=x2,y=as.factor(y))

here we can check that, indeed, it is not separable


Consider a logistic regression (the course is on linear models)

reg = glm(y~x1+x2,data=df,family=binomial(link = "logit"))

but any model here can be used… We can use our own function


or any R package actually



We can plot the two simultaneously here

V=Vectorize(roc.curve)(seq(-5,5,length=251))points(V[1,],V[2,])segments(0,0,1,1,col="light blue")

So our code works just fine, here. Let us consider various strategies that should lead us to the diagonal.

The first one is : everyone has the same probability (say 50%)



Indeed, we have the diagonal. But to be honest, we have only two points here : (0,0) and (1,1). Claiming that we have a straight line is not very satisfying… Actually, note that we have this situation whatever the probability we choose



We can try another strategy, like “making a prediction by tossing of an unbiased coin“. This is what we obtain



V=Vectorize(roc.curve)(seq(0,1,length=251))points(V[1,],V[2,])segments(0,0,1,1,col="light blue")

We can also try some sort of “random classifier”, where we choose the score randomly, say uniform on the unit interval



V=Vectorize(roc.curve)(seq(0,1,length=251))points(V[1,],V[2,])segments(0,0,1,1,col="light blue")

Let us try to go further on that one. For convenience, let us consider another function to plot the ROC curve


roc_curve=Vectorize(function(x) max(V[2,which(V[1,]&lt;=x)]))

We have the same line as previously



But now, consider many scoring strategies, all randomly chosen

MY=matrix(NA,500,length(y))for(i in 1:500){
MY[i,]=roc_curve(x)}plot(performance(prediction(S,df$y),"tpr","fpr"),col="white")for(i in 1:500){lines(x,MY[i,],col=rgb(0,0,1,.3),type="s")}lines(c(0,x),c(0,apply(MY,2,mean)),col="red",type="s",lwd=3)segments(0,0,1,1,col="light blue")

The red line is the average of all random classifiers. It is not a straight line, be we observe oscillations around the diagonal.

Consider a dataset with more observations


myocarde = read.table("",head=TRUE, sep=";")

myocarde$PRONO = (myocarde$PRONO=="SURVIE")*1

reg = glm(PRONO~.,data=myocarde,family=binomial(link = "logit"))



V=Vectorize(roc.curve)(seq(-5,5,length=251))points(V[1,],V[2,])segments(0,0,1,1,col="light blue")

Here is a “random classifier” where we draw scores randomly on the unit interval


V=Vectorize(roc.curve)(seq(-5,5,length=251))points(V[1,],V[2,])segments(0,0,1,1,col="light blue")

And if we do that 500 times, we obtain, on average

MY=matrix(NA,500,length(y))for(i in 1:500){
MY[i,]=roc_curve(x)}plot(performance(prediction(S,Y),"tpr","fpr"),col="white")for(i in 1:500){lines(x,MY[i,],col=rgb(0,0,1,.3),type="s")}lines(c(0,x),c(0,apply(MY,2,mean)),col="red",type="s",lwd=3)segments(0,0,1,1,col="light blue")

So, it looks like me might say that the diagonal is what we have, on average, when drawing randomly scores on the unit interval…

I did mention that an interesting visual tool could be related to the use of the Kolmogorov Smirnov statistic on classifiers. We can plot the two empirical cumulative distribution functions of the scores, given the response Y




we can also look at the distribution of the score, with the histogram (or density estimates)



The underlying idea is the following : we do have a “perfect classifier” (top left corner)

is the supports of the scores do not overlap

otherwise, we should have errors. That the case below

we in 10% of the cases, we might have misclassification

or even more missclassification, with overlapping supports

Now, we have the diagonal

when the two conditional distributions of the scores are identical

Of course, that only valid when n is very large, otherwise, it is only what we observe on average….

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Pear Therapeutics: Data Scientist (Analytics) [San Francisco, CA, or Boston, MA]

As a Data Scientist, you will be responsible for shaping and delivering data-driven insights. We are looking for data scientists with a deep product sense, who have an innate curiosity, and are eager to dive into large, complex datasets and create actionable insights.

Continue Reading…


Read More

how YOU visualized it

A sampling of the many visuals created this month

A sampling of the many visuals created this month

The March #SWDchallenge took on a slightly different flavor. Rather than summoning you to try a certain graph type or approach, this month’s goal was effectiveness. We sought to see how people varied in answering specific questions about the same dataset. Data was sourced from AidData in partnership with Enrico Bertini, Associate Professor at NYU, who will be undertaking some data visualization research based on this challenge.

Sixty-nine people submitted a response to answer the primary question posed (“Who donates?”) and related sub-questions on the interesting patterns in the distribution across countries and recipients. With so many ways to visualize same dataset, you’ll see evidence that there isn’t a single “right” answer when it comes to how we show and communicate with data. Data can be visualized in countless different ways and by varying views of the same data, we enable our audience to see different things.

To everyone who submitted examples: THANK YOU for taking the time to create and share your work! We aren’t going to call out specific entries this month, so as not to introduce bias. By participating in this month’s challenge, you’ve helped Enrico push forward some important research. We’ll be sure to share more on that front once it’s available. The submissions below are posted below in alphabetical order and include the link to the original Tweet or interactive visual.

We encourage you to scroll through the entire post and be inspired by your peers’ approaches to this challenge! Spoiler alert: inspiration will be a central theme in the next challenge, which will be announced on April 1. Until then, check out the #SWDchallenge page for the archives of previous months' challenges and submissions. Happy browsing!










Anthony P



Anthony S


































Equal Pay Act





My approach to this challenge was to answer the question based on 2013 data, looking at who the top 10 donors are, who they have made donations to and for what causes. One thing that I have done differently was to group the purposes into broad categories to give us a rough idea on what efforts the top 10 donors are focused on contributing to. (I am looking to write a medium post about my approach soon.
































For this month’s challenge I decided to use Power BI as an opportunity to get more familiar with DAX. I focused on all donations by The Netherlands to other countries. Because there are a lot of recipients and a lot of purposes in this dataset, I decided to show only the top 5 recipients and purposes. But because this shows only a part of the picture, I also wanted to visualize the part of this top 5 in the total amount. I did this with the stacked bar chart and the accompanying text. The interactive visual can be seen here.

























I have visualized a simple line chart with a cumulative sum of $ metric, for the Top 5 Donor countries, to answer the 'Who Donates' question. The visual plot shows that since 1973, United States has been donating the highest $. However, it also shows that Japan is closing in on the Top spot, and likely already has. 




















Thanks for the challenge. Developed a little mobile first visualization that allows people to select their country and check the top donors, the total cumulative over time and also how they distribute geographically.






















Steve B


Steve W
























Vizu All











Click ♥ if you've made it to the bottom—this helps us know that the time it takes to pull this together is worthwhile! Check out the #SWDchallenge page for more. Thanks for reading!

Continue Reading…


Read More

Operator Notation for Data Transforms

As of cdata version 1.0.8 cdata implements an operator notation for data transform.

The idea is simple, yet powerful.

First let’s start with some data.

d <- wrapr::build_frame(
  "id", "measure", "value" |
    1   , "AUC"    , 0.7     |
    1   , "R2"     , 0.4     |
    2   , "AUC"    , 0.8     |
    2   , "R2"     , 0.5     )

id measure value
1 AUC 0.7
1 R2 0.4
2 AUC 0.8
2 R2 0.5

In the above data we have two measurements each for two individuals (individuals identified by the "id" column). Using cdata‘s new_record_spec() method we can capture a description of this record structure.


record_spec <- new_record_spec(
    "measure", "value" |
    "AUC"    , "AUC" |
    "R2"     , "R2"  ),
  recordKeys = "id")

## $controlTable
##   measure value
## 1     AUC   AUC
## 2      R2    R2
## $recordKeys
## [1] "id"
## $controlTableKeys
## [1] "measure"
## attr(,"class")
## [1] "cdata_record_spec"

Once we have this specification we can transform the data using operator notation.

We can collect the record blocks into rows by a "factoring"/"division" (or aggregation/projection) step.

id measure value
1 AUC 0.7
1 R2 0.4
2 AUC 0.8
2 R2 0.5
d2 <- d %//% record_spec

id AUC R2
1 0.7 0.4
2 0.8 0.5

We can expand record rows into blocks by a "multiplication" (or join) step.

id AUC R2
1 0.7 0.4
2 0.8 0.5
d3 <- d2 %**% record_spec

id measure value
1 AUC 0.7
1 R2 0.4
2 AUC 0.8
2 R2 0.5

And that is truly fluid data manipulation.

This article can be found in a vignette here.

Continue Reading…


Read More

Machine Learning Boosts Startups and Industry

BigML, the leading Machine Learning platform, and GoHub from Global Omnium join forces with a strategic partnership to boost Machine Learning adoption throughout the startup and industry sectors. This partnership helps the tech and business sectors apply Machine Learning in their companies, provides them with Machine Learning education and helps them remain competitive in the […]

Continue Reading…


Read More

Critical Thinking in Data Science

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Hugo Bowne-Anderson, the host of DataFramed, the DataCamp podcast, recently interviewed Debbie Berebichez, a physicist, TV host and data scientist and is currently the Chief Data Scientist at Metis in NY.

Introducing Debbie Berebichez

Hugo: Hi there, Debbie, and welcome to DataFramed.

Debbie: Hi, Hugo. It’s a pleasure of mine to be here.

Hugo: It is such a pleasure to have you on this show, and I’m really excited to be here today to talk in particular about critical thinking in data science, and what that actually means, and as we know, to get critical about critical thinking and to see what aspects of data science in the space, what ways we are being critical, where we can actually improve aspects of critical thinking, particularly with respect to data thinking in general. But before we get into that, I’d love to know a bit about you. So, could you start off by telling us what you’re known for in the data community?

Debbie: Sure. Thank you. Well, I’m not sure I’m that well known in the data science community, but if I am, I would say it’s because I’m a big promoter of both critical thinking and of getting minorities, such as women, and especially Hispanic women, to enter the fields of STEM, including data science, and I’ve promoted and have started a bunch of initiatives geared towards getting more women to get into science, technology, and engineering. The second reason could be because I cohost a TV show for the Discovery Channel called Outrageous Acts of Science, so I know that a lot of people know me from there.

Hugo: Right, and you also are at Metis, aren’t you, the data science bootcamp?

Debbie: Absolutely. I was gonna say that next. So, I’m the chief data scientist at Metis. Metis is a data science training company that is part of Kaplan, the large education company, and we basically have two modes of teaching data science. One is through bootcamps, which we host in person at four locations, in New York, Chicago, San Francisco, and Seattle, and the second mode of teaching is through corporate training and other products. So, we teach live online intro to data science as a pre-bootcamp course, but we also customize various courses for corporations that need either visualization courses or Python programming or big data techniques and whatnot, and we’ve had quite a bit of success with that.

Hugo: That’s great, and I look forward later on to talking about kind of the relationship to your work at Metis, bootcamps in general, how they can prepare people for a job market where … In the job market, in some respects, coding skills are at the forefront and not critical thinking skills, and how to deal with that trade-off in the education space, which is something we think a lot about.

Debbie: Absolutely.

Hugo: On top of that, though, you mentioned you’re a big promoter of women, in particular, Hispanic women in the space, and correct me if I’m wrong, I may have mess this up completely, but you were the first Mexican woman to get a PhD from Stanford in physics?

Debbie: Wow. You didn’t get it wrong.

Hugo: I got that right?

Debbie: Yes.

Hugo: Fantastic.

Debbie: That’s right, and I think it’s an important statistic not so much to brag about it, but to show that examples like mine, of persevering and working really hard and making your dream come true exist out there, and they’re so important to talk about because they really serve as inspiration for people who sometimes think that their particular minority group or so is not suited for a career in data science or STEM.

Hugo: So, is this how you got interested in data science and computation initially, through a physics PhD?

Debbie: Yeah, yeah. I have kind of a, I guess, not so atypical background for data science. I did my PhD in physics at Stanford, like you said, and I did theoretical physics. I did a lot of computational work the last two years, and so I learned about models and programming and working with data. Then I moved to New York to do two postdocs at Columbia University and at the Courant Institute, part of NYU, after which I decided, like a lot of physicists, to work in Wall Street for a few years as what is sometimes derogatorily called a quant. I was involved in creating risk models, and I did a lot of data analysis, and that’s when I realized that my skills in math and programming had other alternative ways of being applied, not just in physics.

Debbie: So then, after Wall Street, I thought that that was not the field for me because I didn’t really care about just making money, even though making money is nice, but I had bigger aspirations, and I wanted to do data and ethics and help the world and change the world in many ways, and so I’d heard about this new field, sort of new field for me at least, called data science about 10 years ago, and I took a course. It was kind of like a bootcamp. I had the skills, but I didn’t know how to translate them into the different techniques and algorithms that are typical of data science. So, after taking that course, I jumped ship and I started my career in data science.

Hugo: Awesome. That’s a really interesting trajectory, and I just want to step back a bit, and if you don’t want to talk about this, we don’t have to, but I’m just wondering, coming from where you were in Mexico, did you have kind of a social, cultural, and even familial or parental support to go down this path?

Debbie: No, I didn’t, and that is precisely why I care so much about inspiring and helping other young women who, like myself, feel attracted to a career in science or engineering, but who for some reason, whether it be financial or social, feel that they cannot achieve their dreams. From a very young age growing up in Mexico City, I was discouraged from pursuing a career in physics and math because I was a girl, and I was told by friends and parents and teachers in school that I better pick something more feminine, and that to do physics I had to practically be a genius, which I knew I wasn’t, and so they really discouraged me so much that I became insecure about my math skills and about my ability to conquer and study the field.

Debbie: So, years later when it came to go to university, I picked philosophy as an undergrad because I thought that that was something similar to physics. It had a lot of questions, and you could use your imagination to ask yourself why are we here, and all kinds of things that had to do with objects that surround us and their meaning and whatnot, but I realized, Hugo, that the more I tried to hide my love for physics and math, the more that this inner voice telling me to go for it and to study it was screaming at me, until two years into the bachelor’s program in Mexico, I decided behind everyone’s back to apply to schools in the US as a transfer student, and it was difficult because in Mexico we were paying an eighth of what universities cost in the US, and especially as a foreign student, it’s very hard to find scholarships and financial help, but I was extremely lucky that I got full scholarship offered to me by Brandeis University in Massachusetts, and so in the middle of my BA in philosophy, I transferred to Brandeis in the winter. I hadn’t seen the snow before, and I picked up philosophy courses, but right in my first semester I had the courage to take my first intro to physics class. It was a very large classroom with a hundred students, and the class was astronomy 101.

Debbie: In that class, I realized that my passion and my love for physics was not gonna go away, and I befriended the teaching assistant in the classroom who was a graduate student by the name of Rupesh, who came from India. He came from Darjeeling, town in the Himalayas, and Rupesh and I became friends, and we would meet all the time, and he was the first person who truly believed in me, and he told me that I wasn’t the typical student that just wanted to get an A in the homework, that my questions were just so curious, and I was so inquisitive, and that I really, really cared about knowing about the planets and quantum mechanics and statistical mechanics, and all kids of things, and so he really encouraged me to try to do physics, until one day, we were walking in Harvard Square in Cambridge, and we sat under a tree, and I looked at Rupesh with tears in my eyes, and I said, “Rupesh, I just don’t want to die without trying. I don’t want to die without trying to do physics.”

Debbie: He got up, and we didn’t have cellphones at the time, but he called his advisor who was the head of the physics department at Brandeis, Dr. Wardle, who was the professor in my astronomy class, and he said, “I have a student here who has a scholarship for only two years because she’s a transfer student, and I know that BA in physics takes normally four years to complete, but she’s really, really passionate. What can we do about it?” So, Dr. Wardle called me into his office, and we had a conversation, and he basically told me, me and Rupesh, who was there with me, he said, “Believe it or not, there’s somebody else who’s done this in the past at Brandeis. His name is Ed Witten. He is-”

Hugo: Wow …

Debbie: I know. For those people who know physics and know who he is, he’s basically the father of string theory, so he definitely qualifies as a genius, and so I thought he was pulling my leg, like okay, Ed Witten, there’s no way I could achieve this. But he said, "Ed switched at Brandeis from history to physics, and he did it in only two years," because I couldn’t ask my family to pay for another extra two years to stay there, and so what Dr. Wardle offered is he gave me a book called Div, Grad and Curl, which is vector calculus in three dimensions, and basically, he said to me, "If by the end of the summer you’re able to master this material," and Hugo, I didn’t even remember algebra at this point-

Hugo: And of course, there’s a whole bunch of linear algebra, which goes into this vector calculus. Right?

Debbie: Of course. There’s so much background you have to know to even get into studying this book. So, he said, "If in two months," because this was in the month of May, "you’re able to master this material, we’ll give you a test, and we’ll let you skip through the first two years of the physics major, so you can basically finish the whole BA in only two years." So, Rupesh looked at me, and he said, "We’re gonna do this," and he decided, incredibly, to devote his entire summer from mid June to end of August to teaching me and mentoring me, and basically covering all the subjects that I needed to master in order to enter the third year of physics in September.

Debbie: It was amazing because I was so incredibly hardworking and passionate that I didn’t move from my desk. Every day, Rupesh taught me from 9:00 in the morning till 9:00 p.m. We didn’t have much time, so it was just practical, knowing how to solve derivatives on Saturday. Sunday, we’ll do integrals. Monday, first three chapters of classical mechanics, and you get the idea. So, at the end of the summer I presented the test, and I passed. I tried to not burn too many capacitors in my first electronics lab at the time, and I remember how incredibly grateful I was to Rupesh, this person that absolutely changed the course of my life.

Debbie: I tell this story every time I have an opportunity because it’s incredible to me what Rupesh told me. I basically always wanted to pay him for all that he dedicated to me and all the effort he put into tutoring me, and he said to me that when he was growing up in India, in Darjeeling, there was an old man who used to climb up to his little town in this mountainous terrain, and used to teach him and his sisters the tabla, the musical instrument, math and English, and every time the family wanted to compensate this old man, he said, "No. The only way you could ever pay me back is if you do this with someone else in the world."

Debbie: That beautiful story is how my mission in life began, and Rupesh passed the torch of knowledge to me to inspire, help, and encourage other minorities who, like myself, dream of becoming scientists or engineers, but who for some reason lack the confidence or the skills at the time, and that has really informed my career. It has been the passion that connects everything that I’ve done, and I’m incredibly grateful to that pay-it-forward story. So, after graduating with highest honors from Brandeis is when I went to Stanford, and I reconnected with Rupesh only about seven years after that, because he had gone to the South Pole to be a submillimeter astronomer, and we connected, and he was incredibly proud that I managed to graduate and do my research with a Nobel Prize winner at Stanford, and it was a great story.

Critical Thinking and Data Science

Hugo: Firstly, Debbie, thank you so much for sharing that beautiful story. Secondly, I wish I had a box of tissues with me right now, and thirdly, I feel like I was sitting there under that tree with you and Rupesh solving all the vector calculus challenges, and I want to give Rupesh a big hug and a bunch of cash right now as well, but of course, I’ll do exactly what I’m trying to do and what we need to be doing, which is paying it forward, and I think that actually provides a great segue into talking about critical thinking and data science, how we think about critical thinking as educators, being critical of critical thinking, and maybe I want to frame this conversation by saying there’s just a lot of talk around the skills aspiring data scientists, data analysts, data-fluent, data-literate people need to know, and sometimes to me, anyway, the conversation around this seems to be a little bit superficial, and I was wondering, firstly, if that’s the case for you, and secondly, if it is, what seems superficial about it?

Debbie: Yes. I’m so glad you’re asking this question, Hugo. I can’t tell you how many times I have visited programs where I’ve been a mentor for high school students, and I’ll give you one example. One of these afternoon programs was receiving quite a bit of funding, and there were three groups of young girls from high school working in data science, and they had been taught SQL, so they were masters at it, much more than I was ever proficient at their age, so I was like, “Wow. These girls are really impressive.” There were three groups. They were working at a museum, and so one of them was working with a data set that was about birds in the museum, and they were trying to find patterns by looking at their demographics of the birds and their flying patterns and all this kind of information.

Debbie: Another group was looking at astronomical objects, and a third group was working with turtles because the museum had a whole bunch of turtles in an exhibit. So, I went to the third group that was working with turtles, and I looked at the data that they were working with, and one of the columns said weight, so the weight of the turtles, and so I said, “Oh, wow. So, just out of curiosity, how big are the turtles that you’re working with? Have you ever seen them?” They said, “Oh, yeah, we have. They’re about the size of the palm of my hand.” I said, “Oh, cute. I’d love to see those turtles.” I said, “Okay. So, is the weight here that you have in the column … You don’t have any units for it because you just have the number, and the numbers are around 150 and 200 and 300. So, is this weight in pounds? Is it in kilograms? What is this weight in? What are the units?”

Debbie: All of a sudden, these six girls in the group got all quiet, and none of them ventured to answer until one of them raised her hand and said, “Oh, I think it’s in pounds,” and I said, “Oh, wow. Let’s see. I’m about five-foot-three, and I weight probably about 120 pounds, so this is interesting because a turtle that’s the size of my hand, basically, you’re telling me it weighs double the amount of pounds that I do. Does that make sense?” Then they all laughed and said, “Oh, yeah. You’re right. It doesn’t make sense,” and we had this very nice conversation, and we went back and forth. It turns out, after an hour, we finally found a teacher who knew, and for certain, gave us the information that the weight was actually in grams.

Hugo: Wow.

Debbie: So, the girls were surprised, and that story really caught my attention because I had been visiting a lot of schools and programs that are trying to teach coding in a very kind of fast and superficial way, just to be able to say, "Our students know how to code," and I realized that in an effort to get more and more people to know the skills for data science and for data analysis in a world that’s going way too fast where we need to prepare our students for jobs in AI and machine learning and whatnot, we are forgetting what all of this is for. Coding and analyzing data has a purpose. It’s not an end in itself. The purpose is to be able to solve problems and to have insights about what the data is telling us.

Debbie: If we’re not taught to ask the right questions and to think critically about where the data comes from, why is it being used or collected in a certain way, what other data could help or hurt my dataset, what biases are being introduced by this dataset, if we’re not teaching our kids to think what’s behind these techniques, then we’re basically failing, because we’re just making them like robots who can only perform a simple task if, and only if, the next dataset they see is similar in scope and structure to the one that they’re learning to work with.

Debbie: It was a very moving, and in a way, also painful experience to see, because I realized how needed are those critical skills, and not only in the education at the high school level, but how many projects haven’t we seen at companies, at very large companies and advanced data science groups where there’s a significant bias being introduced because no one bothered to include a certain minority but important group in the statistical sample, or bias was introduced because people didn’t bother to check what some outliers in the dataset were describing et cetera. So, I’m very, very passionate about teaching the critical thinking skills that are behind our why for why we do data science.

Collecting Data

Hugo: You’ve spoken to so many essential points there. The overarching one is critical thinking, and what I like to think of, data thinking or data understanding before even … There’s a movement to put data into models and throw models at data before even looking at, as you say, units or important features, or really getting to know your data, getting to understand it, and performing that type of exploratory data analysis, and a related point that underlay a lot of what you were discussing there is thinking about the data collection process as well, and if you’re collecting data in a certain way, what are you leaving out? What are your instruments not picking up? Is your data censored for any of these reasons? Are you leaving out certain demographics because they don’t use a particular part of your service?

Debbie: Mm-hmm (affirmative). Exactly. Exactly, and I think I see a lot of companies that don’t really know what data science is about, because it has become this buzzword, and everyone wants to be in it, but nobody really knows exactly what you can get out of it, and what’s happening is a lot of companies are investing significant dollar amounts in big data and solving big problems because they have collected so much data, they just build a huge infrastructure and try to find insights, but without really know if, first of all, those insights are important for the company, second of all, if they find them, would they be able to use them for something and enact policies or something that’s actually gonna be helpful for the goals of the company? I always remind them with this kind of simple example. One of my heroes in physics is Tycho Brahe, who was a very famous Danish astronomer. Basically, he was locked up in a tower in an island in Denmark, which I actually had the opportunity to visit last summer.

Hugo: Oh, really?

Debbie: Yes.

Hugo: Wow.

Debbie: He lived in the 1500s, an amazing man, but he also had a … Apparently, he was a nobleman. He had an awful personality, and he lost his nose in a duel.

Hugo: They say he replaced it with a golden bridge, I think.

Debbie: With a bronze-

Hugo: Bronze. Yeah. Okay, great.

Debbie: I think that has been discredited a bit. That’s what they told me in the museum. But anyways, yeah, this very interesting character, but the amazing thing about him is that he looked at the sky without any telescope. He basically had created these sophisticated instruments, but in the 1500s, it took him years, and he created a catalog of only about a thousand stars. That’s it. So, that’s a very, very small dataset by today’s standards, but from only those thousand data points, I think it was like 1,800 or so, to be more accurate, but he helped the theories that were later created by Kepler and Copernicus, and where the laws of planetary motion were derived.

Debbie: Basically, Kepler used that, and then Isaac Newton used it as the basis for the law of gravity. So, from those thousand data points came universal theories that we’re still using today, that are incredibly powerful and deep, and that is a good example to say that sometimes we can put a lot of investment into huge datasets, but when we’re talking about data literacy, large datasets also have a lot of noise, and you have to start by teaching that the most important thing is the insight that you’re going to derive from that dataset and not its size.

Big Data

Hugo: I’d like to speak to this idea of the focus on big data and the fact that a lot of us are collecting as much data as possible, thinking that all the information we need will be contained in there, even before asking critical questions, which is very dangerous, but before that, I just want to say tangentially, Tycho Brahe and Kepler’s story is so wild. I haven’t looked into it in a while, but if I recall correctly, Kepler wanted to unlock the secrets of planetary motion and figure out what was happening, and he realized that Tycho had the data. So, this is a story of someone realizing someone else has this data, and he went to work with him in Tycho Brahe’s, I think, final years, and Tycho didn’t even give him all the data at that point. He was actually very secretive about the data he had, and even when Brahe died, Kepler had to struggle with Brahe’s family in order to get the data. So, there were all types of data secrecy and data privacy issues at that point as well.

Debbie: Also, data ownership, because what-

Hugo: Exactly. That’s what I meant. Yeah.

Debbie: Most people know who Kepler was, but if you ask people about Tycho Brahe, very few non-science people know, and that’s because a lot of the credit went to Kepler, and some people argue that the one that did all the meticulous observations and had theories about it was Tycho, and so he deserved more credit. So, it was kind of a crazy time, and lots of fights about data were happening.

Hugo: Of course, we’re talking about a decoupling or a separation of, let’s say, humans into the people who are fantastic at collecting data and the people who are fantastic at analyzing it as well. This is a division in a lot of ways.

Debbie: Yeah, absolutely.

Hugo: But this focus on big data, the fact that even a lot of companies’ valuations are based around the fact that they have so much data, and it must be useful in the future, right? This is incredibly dangerous for practitioners, but also for society.

Debbie: Absolutely. I mean, we did have a tipping point in that we had the hope in the ’70s of AI and changing the landscape of our society, and it didn’t quite deliver in its promise because we didn’t have the capacity to analyze very, very large datasets like we do now, and there was a tipping point where now we are able to analyze these much, much larger datasets. I mean, I think every day in the world, we produce 10 to the 18 bytes of data, like 3 exabytes of data, something like that, that we generate. So, obviously these are enormous scales, but what’s important is not that we now have this capacity to analyze it, but are we really getting a significant marginal insight, or are the insights that we’re getting commensurate with the ones that we were getting when we didn’t have such large datasets?

Debbie: I think that question’s still out there. We haven’t been able to answer it because, as you know, the real important applications of AI are still being created and worked on. A lot of the AI things that we see out there are still simplistic in that they don’t use all of the incredible and deep capacities that AI has to solve problems. So, dimensionality of the data matters. It matters a lot, and probably for certain problems, it’s going to be hugely important. But my point is more about when you’re educating people or when you’re a company investing in certain technology, you have to be able to walk before you run, so start analyzing the smaller datasets, come up with strategies that are based more on critical thinking, and the questions that you’re trying to solve rather than the size of your dataset, and the size of the infrastructure that you’ve built.

Top 3 Critical Thinking Skills to Learn

Hugo: Great. So, I’ve got a thought experiment for you, which may happen all the time. I have no idea. But a student, an aspiring data scientist data analyst comes to you and says, "I need to learn some data thinking skills, some critical thinking skills to work with data. What are the top three critical thinking skills that you think I should learn, Debbie?"

Debbie: Thanks for that question, Hugo. I think the first one is you have to be a skeptic about data. You have to always … Just like when you read a scientific paper, you have to know who paid for this research. Was it the drug company that is sponsoring a paper that says their drug is the only and best drug in the world? Clearly, I’m not gonna trust that paper. So, a healthy skepticism about the team that collected the data, what biases could have been introduced, where was this data taken, how was it collected, what things were left out, what variables would be important in the future, et cetera. All those questions I think are super important. So, if you don’t ask them before even doing exploratory data analysis, it means you’re thinking about the data, and your relationship with the data is gonna be limited.

Debbie: The second one, and this one, I came up with it from another famous physicist, Richard Feynman, who said, "The ability to not fool oneself is one of the hardest and most important skills one can acquire in life," because it’s very easy … Sometimes we think, oh, I wouldn’t be fooled by anyone, not any marketing campaign, not any government is gonna fool me, but we fool ourselves much more often than the people interpreting the data out there. So, the ability to not fall in love with what we think our data should be telling us, that is what I call fooling yourself, that is super important.

Debbie: The third skill is connecting the code and the algorithms to the real world, like my example with high school girls that were working with the data. To be working with a database for three months and forgetting that behind the data are actual turtles, in this example, that’s a big mistake, the same way when Facebook is incredible at doing face recognition and analyzing relationships between groups and people, but if they’re forgetting that behind those connections are real people with real lives and real consequences, then we’re failing. We need to really connect our analysis to the world out there.

Hugo: I agree, and I just want to go through those again, because I’m sure our listeners are scribbling away trying to remember all of this. So, the first one was a healthy skepticism about data, the second, the ability to not fool yourself, and the third, connecting the code and the real world and all the stakeholders that actually exist on the ground.

Debbie: Correct. Thank you, Hugo.


Hugo: So, I just want to build slightly on the ability not to fool yourself. I mean, all of these are incredibly important, but there’s a paper called, I hope I get this right, Many Analysts, One Dataset, that we’ve discussed once or twice on the podcast before, and it gives a whole bunch of statisticians and domain experts a dataset, separates them into teams, and gives them the same dataset and asks … It’s a dataset of, I think, either yellow or red cards given to football players in football or soccer matches, and the question is, are these decisions to give cards, is there some sort of ethnic bias or a racial bias in these decisions?

Hugo: The fact is, what happened was 70% of the teams said one thing, 30% said the other thing, either yes or no, and then when they got to see everyone else’s results, nearly all the teams were even more sure of their own techniques and their own results. There are a lot of reasons for this, but one of the points is that people go in with a certain bias already, and if you have a bias going into a dataset, you make all these micro-decisions as an analyst, which helps you get to the place that you already thought you were going, right?

Debbie: Yeah. You reminded me, funnily, of a paper that I discussed. I don’t even think you could consider it a scientific, sophisticated paper, but it was a paper done for the astrology, not astronomy, but Astrology Association in India years ago, and I talked about it at a conference because they first decided the hypothesis is that through some astrological charts that tell you certain characteristics about some kids, if these people that were the gurus and the chart readers and predictors were able to guess, I think that they gave themself a pretty low score. They said, "If we are able to guess 60% of the outcomes," and I think the question was whether these students were intellectually gifted or just going to be average students in school, based just on their astrological chart, "and if we’re able to get 60% of them right, then that means we are gurus, and astrology is true, and we are able to predict this with very high confidence." That was their confidence level.

Debbie: The funny thing is even though they did slightly worse than a coin toss, that is they got 49% of them right, and anybody in their right mind would be able to say, "Well, clearly they did even worse than chance, a toss of a coin would’ve done better," but they themselves patted themselves on the back saying, "You see? We got 49% right. We can do this." So, it’s a very funny paper, and I encourage people to read it because it’s so easy to fool ourselves.

Hugo: Absolutely, and the best thing about doing worse than a coin toss is you could actually just switch all your decisions and do better. So, we’ve been talking about critical thinking at an individual and societal level. I’m wondering how you think about the needs for all these skills, critical thinking skills, how they should be spread through organizations, and what I mean is, what type of critical thinking and data thinking skills will be needed and are needed for people who don’t even work directly with data themselves, but in jobs impacted by data?

Debbie: Yes. That’s an excellent question because I think the more that our field of data science grows, the more that we get different dependencies in companies, different groups needing insights or even having contact with the data, and not everybody’s going to be a data scientist. We’re gonna have people just interpret visualizations that come from the data, others using APIs and having to interpret what the algorithms come up with and whatnot. So, I think it’s essential that we spread the critical thinking message across organizations, and it has to start early in school because the ability to ask the right questions in an industry setting in incredibly important, and I don’t think we’re putting enough emphasis in it. So, I think everybody in an organization has to be trained about things such as data ethics. How is the data being collected? Are we using it for the right purpose? Data ownership, data privacy, data security, all kinds of issues that impact the manipulation of data, and so that’s part of the critical thinking process.

Hugo: Hopefully, this aspect of understanding on the part of people in society and other working professionals who aren’t data scientists will result in less burden on the data scientists. What I really mean by that is … Well, there are a few ways to frame it. The first way is I think it was probably Nate Silver who said this. Any quotation I don’t know who it was, I’ll just say it’s Nate Silver, generally. But it was probably Nate Silver who said something like, "When a data scientist gets something right, they’re thought of as a god, and when they get something wrong, they’re thought of as they’ve made the worst mistakes ever," as opposed to a job in which sometimes you get it right, and sometimes you get it wrong.

Hugo: Another way to frame it is it kind of viewed by people without data skills are like, "I have no idea how to deal with this, so this is what you’re going to do, and you have kind of … You’re a prophet, or you’re the holder of divine knowledge, or the high priest of data science", I like to call them, and whether this will actually help, as people develop more data skills who aren’t data scientists, will actually help bridge this gap in a lot of ways. So, how do you think about these types of issues and challenges when building data science curricula at Metis and elsewhere?

Debbie: Yeah. It’s very important for me to learn … I’m not an expert in the field of learning science, but it’s very important to me to learn how to best build curriculum that optimizes these critical thinking principles and questions that I’m talking about, and so it really depends on the curriculum. So, for example, we built with a team with Cathy O’Neil, who I know you’ve interviewed before, who I love, and a group of others, seven executive women with the funding from Moody’s Analytics and the help of Girls, Inc., we developed the first data science curriculum for high school girls of under-served backgrounds, and we deployed it in New York in several high schools.

Debbie: So, I think it was just this amazing experience because we try to emphasize focusing on the topic and what the consequences were of every single step in the process, from data collecting, to choosing the algorithm, to knowing how to measure the accuracy, the recall, the precision, everything that we were doing, where it comes from, how to choose the metric that was right for the problem at hand, et cetera, and so the intention was very conscious to be about how to get the most insight about the limitations and the successes of the challenge or the problem at hand.

Debbie: When I build curriculum for the Metis bootcamp currently in my position, I want the students to have a pretty broad set of tools with which they can crack really hard problems. So, I may not focus on getting every single clustering algorithm there is in the curriculum, but I will focus on how to analyze the results of the clustering algorithms that we will see, and how to know if we’re using the right algorithms for the problem at hand, and how to be able to ask that question of our colleagues, of our communities, et cetera, because we all have limitations to our knowledge.


Hugo: Yeah. There are two things there I want to focus on. The first is, as you said, at Metis, thinking about the actual problems, and thinking about the question at hand before even getting coding I think is incredibly important, and also, educating people through questions that really pertain to them and are interesting to them. So, students will ask me, "If I want to embark upon my first data science project, what would you suggest I do?" I say, "Well, what are you interested in," and if they have a fitness tracker, for example, I say, "Maybe you could analyze your own fitness data. If you’re a foodie, scrape Yelp reviews of restaurants and work with that type of stuff. If you love movies, if you’re a cinephile, the OMDB has a fantastic API."

Debbie: That’s exactly what we do at Metis. We have our students in the bootcamp use their own dataset, and they create their own project. So, it’s really cool. I encourage people to go to, and it’s a site where we have some of our greatest projects, and it’s incredible because you see people that had very basic math and programming skills coming in, and in three months they’re able to analyze contamination sources in the ocean, or some healthcare-related thing, or an app that helps you choose the best restaurant for crepes that evening, and stuff like … It’s really, really cool what you can do.

Hugo: Yeah, and I’ll build on that by saying I’ve been to several of Metis’s graduation presentations. What do you call them?

Debbie: Career Day.

Hugo: Yeah. They’re incredible, and seeing all the learner students there present the work they’ve done is amazing, and I know that … For example, you know I’ve had Emily Robinson on the podcast. I work with her now at DataCamp, and she completed Metis, and I think she went to Etsy straight from Metis. I could be wrong there.

Debbie: Yes. We love Emily.

Future of Data Science

Hugo: Yeah, incredible. So, we’re gonna wrap up in a few minutes, but I’d like to … We’ve talked about the state of play of critical thinking today, but I’d like to … It’s a prediction problem. So, what does the future of data science look like to you, Debbie?

Debbie: To me, it’s going to merge with the industry of IOT or the internet of things. That is, as we see the ubiquitous sensors, that these sensors are simply everywhere, from medical devices, to buildings that are smart buildings testing our comfort level, to apps that match our behavior, it’s-

Hugo: I mean, you’re right. We wear them, and we carry them in our pockets, right?

Debbie: Exactly, and just like the personal computer came to revolutionize the information technology field, the same way, IOT is going to revolutionize, and we’re gonna see a new paradigm where we’re going to collect substantial more amounts of data about ourselves, our behaviors, our connections, and so issues that have to do with data privacy, data ownership, security, analysis, insights are going become evermore important. So, what I predict is that with more automation, we’re gonna have more needs to have people that are not necessarily the data scientists working with the data, but are working in the field to analyze the ethical consequences of it to act as peer reviewing committees to see if there should be policies or regulations that should be enforced around certain applications, et cetera. So, that’s what I see for a future, more and more need for sort of adjacent professions that help with the data analysis process.

Hugo: Yeah, I think you’re right in terms of defining it anyway or describing it as a merging between data science and IOT and automation. I can’t quite remember, did you give a talk on the internet of things at the NYR… Jared’s conference, a few years ago?

Debbie: Yes, I did at the R… Yep. Yeah.

What is your favorite data science technique?

Hugo: Okay, great. Well, I loved that talk, and Jared puts all those talks up online, so I’ll find a link for that and put that in the show notes as well, if anyone’s interested. So, I want to get a bit technical. I’m wondering what one of your favorite data sciencey techniques or methodologies is, just something you love to do.

Debbie: I actually really, really love singular value decomposition, SVD. I’ve always loved linear algebra, and just the thought of being able to reduce the dimensionality of a problem is so sexy to me. In physics, we deal with all the time, and my first encounter with it was when I worked briefly with David Botstein, who’s … This is many, many years ago at Stanford. He’s one of the creators of Genentech, the biotech company, and we were analyzing the data coming from DNA microarrays, which basically compare a sample of healthy DNA with a sample that came from a patient in order to conclude whether the patient had cancer, and in the case of a positive answer, what type of breast cancer it was.

Debbie: So, it was really, really interesting because, obviously, there are so many genes in our genome that the dimension of the problem was humongous, and so to apply SVD and be able to reduce it to the dimensions that were most important enabled them to come up with pretty customized drugs that I have heard, because I have since stopped working in that topic, but I’ve heard are working quite well for different types of breast cancer. So, the applications of SVD are incredible, and so I don’t know, I just really like that conceptually, and anything that has to do with that, even NLP and, I don’t know, just seeing what you can get by sacrificing a bit of information is just really interesting to me.

Hugo: Well, I’m sold. I mean, you’ve motivated it through linear algebra, which I also love, and then you gave some incredibly important examples of its use, and for those of you out there who know of PCA, I’d definitely suggest you to check out SVD as well.

Debbie: Yeah.

Call to Action

Hugo: I’ve got one final question for you. Do you have a final call to action for our listeners out there?

Debbie: Yes, I do. I’ll repeat, Hugo, what I said in my Grace Hopper Celebration keynote speech a little over a year ago. Think deeply, be bold, and help others.

Hugo: I think that’s fantastic, Debbie, and what we’ll do is we’ll link to your Grace Hopper talk as well, because I think the way you explained in that talk all of these things, why it’s important to think deeply, be bold, and help others, which you’ve kind of gone through this talk as well, I think that talk can provide more context there also.

Debbie: Wonderful. This has been such an awesome conversation, Hugo. Thank you.

Hugo: Thank you so much, Debbie. It’s been an absolute pleasure having you on the show.

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

The AI Black Box Explanation Problem

Introducing Black Box AI, a system for automated decision making often based on machine learning over big data, which maps a user’s features into a class predicting the behavioural traits of the individuals.

Continue Reading…


Read More

R vs Python for Data Visualization

This article demonstrates creating similar plots in R and Python using two of the most prominent data visualization packages on the market, namely ggplot2 and Seaborn.

Continue Reading…


Read More

Building a Raspberry Pi security camera with OpenCV

In this tutorial, you will learn how to build a Raspberry Pi security camera using OpenCV and computer vision. The Pi security camera will be IoT capable, making it possible for our Raspberry Pi to to send TXT/MMS message notifications, images, and video clips when the security camera is triggered.

Back in my undergrad years, I had an obsession with hummus. Hummus and pita/vegetables were my lunch of choice.

I loved it.

I lived on it.

And I was very protective of my hummus — college kids are notorious for raiding each other’s fridges and stealing each other’s food. No one was to touch my hummus.

But — I was a victim of such hummus theft on more than one occasion…and I never forgot it!

I never figured out who stole my hummus, and even though my wife and I are the only ones who live in our house, I often hide the hummus in the back of the fridge (where no one will look) or under fruits and vegetables (which most people wouldn’t want to eat).

Of course, back then I wasn’t as familiar with computer vision and OpenCV as I do now. Had I known what I do at present, I would have built a Raspberry Pi security camera to capture the hummus heist in action!

Today I’m channeling my inner undergrad-self and laying rest to the chickpea bandit. And if he ever returns again, beware, my fridge is monitored!

To learn how to build a security camera with a Raspberry Pi and OpenCV, just keep reading!

Looking for the source code to this post?
Jump right to the downloads section.

Building a Raspberry Pi security camera with OpenCV

In the first part of this tutorial, we’ll briefly review how we are going to build an IoT-capable security camera with the Raspberry Pi.

Next, we’ll review our project/directory structure and install the libraries/packages to successfully build the project.

We’ll also briefly review both Amazon AWS/S3 and Twilio, two services that when used together will enable us to:

  1. Upload an image/video clip when the security camera is triggered.
  2. Send the image/video clip directly to our smartphone via text message.

From there we’ll implement the source code for the project.

And finally, we’ll put all the pieces together and put our Raspberry Pi security camera into action!

An IoT security camera with the Raspberry Pi

Figure 1: Raspberry Pi + Internet of Things (IoT). Our project today will use two cloud services: Twilio and AWS S3. Twilio is an SMS/MMS messaging service. S3 is a file storage service to help facilitate the video messages.

We’ll be building a very simple IoT security camera with the Raspberry Pi and OpenCV.

The security camera will be capable of recording a video clip when the camera is triggered, uploading the video clip to the cloud, and then sending a TXT/MMS message which includes the video itself.

We’ll be building this project specifically with the goal of detecting when a refrigerator is opened and when the fridge is closed — everything in between will be captured and recorded.

Therefore, this security camera will work best in the same “open” and “closed” environment where there is a large difference in light. For example, you could also deploy this inside a mailbox that opens/closes.

You can easily extend this method to work with other forms of detection, including simple motion detection and home surveillance, object detection, and more. I’ll leave that as an exercise for you, the reader, to implement — in that case, you can use this project as a “template” for implementing any additional computer vision functionality.

Project structure

Go ahead and grab the “Downloads” for today’s blog post.

Once you’ve unzipped the files, you’ll be presented with the following directory structure:

$ tree --dirsfirst
├── config
│   └── config.json
├── pyimagesearch
│   ├── notifications
│   │   ├──
│   │   └──
│   ├── utils
│   │   ├──
│   │   └──
│   └──

4 directories, 7 files

Today we’ll be reviewing four files:

  • config/config.json
     : This commented JSON file holds our configuration. I’m providing you with this file, but you’ll need to insert your API keys for both Twilio and S3.
  • pyimagesearch/notifications/
     : Contains the
      class for sending SMS/MMS messages. This is the same exact class I use for sending text, picture, and video messages with Python inside my upcoming Raspberry Pi book.
  • pyimagesearch/utils/
     : The
      class is responsible for loading the commented JSON configuration.
     : The heart of today’s project is contained in this driver script. It watches for significant light change, starts recording video, and alerts me when someone steals my hummus or anything else I’m hiding in the fridge.

Now that we understand the directory structure and files therein, let’s move on to configuring our machine and learning about S3 + Twilio. From there, we’ll begin reviewing the four key files in today’s project.

Installing package/library prerequisites

Today’s project requires that you install a handful of Python libraries on your Raspberry Pi.

In my upcoming book, all of these packages will be preinstalled in a custom Raspbian image. All you’ll have to do is download the Raspbian .img file, flash it to your micro-SD card, and boot! From there you’ll have a pre-configured dev environment with all the computer vision + deep learning libraries you need!

Note: If you want my custom Raspbian images right now (with both OpenCV 3 and OpenCV 4), you should grab a copy of either the Quickstart Bundle or Hardcopy Bundle of Practical Python and OpenCV + Case Studies which includes the Raspbian .img file.

This introductory book will also teach you OpenCV fundamentals so that you can learn how to confidently build your own projects. These fundamentals and concepts will go a long way if you’re planning to grab my upcoming Raspberry Pi for Computer Vision book.

In the meantime, you can get by with this minimal installation of packages to replicate today’s project:

  • opencv-contrib-python
     : The OpenCV library.
  • imutils
     : My package of convenience functions and classes.
  • twilio
     : The Twilio package allows you to send text/picture/video messages.
  • boto3
     : The
      package will communicate with the Amazon S3 files storage service. Our videos will be stored in S3.
  • json-minify
     : Allows for commented JSON files (because we all love documentation!)

To install these packages, I recommend that you follow my pip install opencv guide to setup a Python virtual environment.

You can then pip install all required packages:

$ workon <env_name> # insert your environment name such as cv or py3cv4
$ pip install opencv-contrib-python
$ pip install imutils
$ pip install twilio
$ pip install boto3
$ pip install json-minify

Now that our environment is configured, each time you want to activate it, simply use the


Let’s review S3, boto3, and Twilio!

What is Amazon AWS and S3?

Figure 2: Amazon’s Simple Storage Service (S3) will be used to store videos captured from our IoT Raspberry Pi. We will use the boto3 Python package to work with S3.

Amazon Web Services (AWS) has a service called Simple Storage Service, commonly known as S3.

The S3 services is a highly popular service used for storing files. I actually use it to host some larger files such as GIFs on this blog.

Today we’ll be using S3 to host our video files generated by the Raspberry Pi Security camera.

S3 is organized by “buckets”. A bucket contains files and folders. It also can be set up with custom permissions and security settings.

A package called

  will help us to transfer the files from our Internet of Things Raspberry Pi to AWS S3.

Before we dive into

 , we need to set up an S3 bucket.

Let’s go ahead and create a bucket, resource group, and user. We’ll give the resource group permissions to access the bucket and then we’ll add the user to the resource group.

Step #1: Create a bucket

Amazon has great documentation on how to create an S3 bucket here.

Step #2: Create a resource group + user. Add the user to the resource group.

After you create your bucket, you’ll need to create an IAM user + resource group and define permissions.

  • Visit the resource groups page to create a group. I named my example “s3pi”.
  • Visit the users page to create a user. I named my example “raspberrypisecurity”.

Step #3: Grab your access keys. You’ll need to paste them into today’s config file.

Watch these slides to walk you through Steps 1-3, but refer to the documentation as well because slides become out of date rapidly:

Figure 3: The steps to gain API access to Amazon S3. We’ll use boto3 along with the access keys in our Raspberry Pi IoT project.

Obtaining your Twilio API keys

Figure 4: Twilio is a popular SMS/MMS platform with a great API.

Twilio, a phone number service with an API, allows for voice, SMS, MMS, and more.

Twilio will serve as the bridge between our Raspberry Pi and our cell phone. I want to know exactly when the chickpea bandit is opening my fridge so that I can take countermeasures.

Let’s set up Twilio now.

Step #1: Create an account and get a free number.

Go ahead and sign up for Twilio and you’ll be assigned a temporary trial number. You can purchase a number + quota later if you choose to do so.

Step #2: Grab your API keys.

Now we need to obtain our API keys. Here’s a screenshot showing where to create one and copy it:

Figure 5: The Twilio API keys are necessary to send text messages with Python.

A final note about Twilio is that it does support the popular What’s App messaging platform. Support for What’s App is welcomed by the international community, however, it is currently in Beta. Today we’ll be demonstrating standard SMS/MMS only. I’ll leave it up to you to explore Twilio in conjunction with What’s App.

Our JSON configuration file

There are a number of variables that need to be specified for this project, and instead of hardcoding them, I decided to keep our code more modular and organized by putting them in a dedicated JSON configuration file.

Since JSON doesn’t natively support comments, our

  class will take advantage of JSON-minify to parse out the comments. If JSON isn’t your config file of choice, you can try YAML or XML as well.

Let’s take a look at the commented JSON file now:

	// two constants, first threshold for detecting if the
	// refrigerator is open, and a second threshold for the number of
	// seconds the refrigerator is open
	"thresh": 50,
	"open_threshold_seconds": 60,

Lines 5 and 6 contain two settings. The first is the light threshold for determining when the refrigerator is open. The second is a threshold for the number of seconds until it is determined that someone left the door open.

Now let’s handle AWS + S3 configs:

// variables to store your aws account credentials
	"aws_access_key_id": "YOUR_AWS_ACCESS_KEY_ID",
	"aws_secret_access_key": "YOUR_AWS_SECRET_ACCESS_KEY",
	"s3_bucket": "YOUR_AWS_S3_BUCKET",

Each of the values on Lines 9-11 are available in your AWS console (we just generated them in the “What is Amazon AWS and S3?” section above).

And finally our Twilio configs:

// variables to store your twilio account credentials
	"twilio_sid": "YOUR_TWILIO_SID",
	"twilio_auth": "YOUR_TWILIO_AUTH_ID",
	"twilio_to": "YOUR_PHONE_NUMBER",
	"twilio_from": "YOUR_TWILIO_PHONE_NUMBER"

Twilio security settings are on Lines 14 and 15. The

  value must match one of your Twilio phone numbers. If you’re using the trial, you only have one number. If you use the wrong number, are out of quota, etc., Twilio will likely send an error message to your email address.

Phone numbers can be formatted like this in the U.S.:


Loading the JSON configuration file

Our configuration file includes comments (for documentation purposes) which unfortunately means we cannot use Python’s built-in

  package which cannot load files with comments.

Instead, we’ll use a combination of JSON-minify and a custom 

  class to load our JSON file as a Python dictionary.

Let’s take a look at how to implement the

  class now:

# import the necessary packages
from json_minify import json_minify
import json

class Conf:
	def __init__(self, confPath):
		# load and store the configuration and update the object's
		# dictionary
		conf = json.loads(json_minify(open(confPath).read()))

	def __getitem__(self, k):
		# return the value associated with the supplied key
		return self.__dict__.get(k, None)

This class is relatively straightforward. Notice that in the constructor, we use

  (Line 9) to parse out the comments prior to passing the file contents to


  method will grab any value from the configuration with dictionary syntax. In other words, we won’t call this method directly — rather, we’ll simply use dictionary syntax in Python to grab a value associated with a given key.

Uploading key video clips and sending them via text message

Once our security camera is triggered we’ll need methods to:

  • Upload the images/video to the cloud (since the Twilio API cannot directly serve “attachments”).
  • Utilize the Twilio API to actually send the text message.

To keep our code neat and organized we’ll be encapsulating this functionality inside a class named

  — let’s review this class now:

# import the necessary packages
from import Client
import boto3
from threading import Thread

class TwilioNotifier:
	def __init__(self, conf):
		# store the configuration object
		self.conf = conf

	def send(self, msg, tempVideo):
		# start a thread to upload the file and send it
		t = Thread(target=self._send, args=(msg, tempVideo,))

On Lines 2-4, we import the Twilio

 , Amazon’s 
 , and Python’s built-in 

From there, our

  class and constructor are defined on Lines 6-9. Our constructor accepts a single parameter, the configuration, which we presume has been loaded from disk via the

This project only demonstrates sending messages. We’ll be demonstrating receiving messages with Twilio in an upcoming blog post as well as in the Raspberry Pi Computer Vision book.


  method is defined on Lines 11-14. This method accepts two key parameters:

  • The string text
  • The video file,
     . Once the video is successfully stored in S3, it will be removed from the Pi to save space. Hence it is a temporary video.


  method kicks off a
  to actually send the message, ensuring the main thread of execution is not blocked.

Thus, the core text message sending logic is in the next method,


def _send(self, msg, tempVideo):
		# create a s3 client object
		s3 = boto3.client("s3",

		# get the filename and upload the video in public read mode
		filename = tempVideo.path[tempVideo.path.rfind("/") + 1:]
		s3.upload_file(tempVideo.path, self.conf["s3_bucket"],
			filename, ExtraArgs={"ACL": "public-read",
			"ContentType": "video/mp4"})


  method is defined on Line 16. It operates as an independent thread so as not to impact the driver script flow.

Parameters (

 ) are passed in when the thread is launched.


  method first will upload the video to AWS S3 via:

  • Initializing the
      client with the access key and secret access key (Lines 18-21).
  • Uploading the file (Lines 25-27).

Line 24 simply extracts the

  from the video path since we’ll need it later.

Let’s go ahead and send the message:

# get the bucket location and build the url
		location = s3.get_bucket_location(
		url = "https://s3-{}{}/{}".format(location,
			self.conf["s3_bucket"], filename)

		# initialize the twilio client and send the message
		client = Client(self.conf["twilio_sid"],
			from_=self.conf["twilio_from"], body=msg, media_url=url)
		# delete the temporary file

To send the message and have the video show up in a cell phone messaging app, we need to send the actual text string along with a URL to the video file in S3.

Note: This must be a publicly accessible URL, so ensure that your S3 settings are correct.

The URL is generated on Lines 30-33.

From there, we’ll create a Twilio

  (not to be confused with our boto3
  client) on Lines 36 and 37.

Lines 38 and 39 actually send the message. Notice the

 , and

Finally, we’ll remove the temporary video file to save some precious space (Line 42). If we don’t do this it’s possible that your Pi may run out of space if your disk space is already low.

The Raspberry Pi security camera driver script

Now that we have (1) our configuration file, (2) a method to load the config, and (3) a class to interact with the S3 and Twilio APIs, let’s create the main driver script for the Raspberry Pi security camera.

The way this script works is relatively simple:

  • It monitors the average amount of light seen by the camera.
  • When the refrigerator door opens, the light comes on, the Pi detects the light, and the Pi starts recording.
  • When the refrigerator door is closed, the light turns off, the Pi detects the absence of light, and the Pi stops recording + sends me or you a video message.
  • If someone leaves the refrigerator open for longer than the specified seconds in the config file, I’ll receive a separate text message indicating that the door was left open.

Let’s go ahead and implement these features.

Open up the
  file and insert the following code:

# import the necessary packages
from __future__ import print_function
from pyimagesearch.notifications import TwilioNotifier
from pyimagesearch.utils import Conf
from import VideoStream
from import TempFile
from datetime import datetime
from datetime import date
import numpy as np
import argparse
import imutils
import signal
import time
import cv2
import sys

Lines 2-15 import our necessary packages. Notably, we’ll be using our

 , and OpenCV.

Let’s define an interrupt signal handler and parse for our config file path argument:

# function to handle keyboard interrupt
def signal_handler(sig, frame):
	print("[INFO] You pressed `ctrl + c`! Closing refrigerator monitor" \
		" application...")

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-c", "--conf", required=True, 
	help="Path to the input configuration file")
args = vars(ap.parse_args())

Our script will run headless because we don’t need an HDMI screen inside the fridge.

On Lines 18-21, we define a

  class to capture “ctrl + c” events from the keyboard gracefully. It isn’t always necessary to do this, but if you need anything to execute before the script exits (such as someone disabling your security camera!), you can put it in this function.

We have a single command line argument to parse. The

  flag (the path to config file) can be provided directly in the terminal or launch on reboot script. You may learn more about command line arguments here.

Let’s perform our initializations:

# load the configuration file and initialize the Twilio notifier
conf = Conf(args["conf"])
tn = TwilioNotifier(conf)

# initialize the flags for fridge open and notification sent
fridgeOpen = False
notifSent = False

# initialize the video stream and allow the camera sensor to warmup
print("[INFO] warming up camera...")
# vs = VideoStream(src=0).start()
vs = VideoStream(usePiCamera=True).start()

# signal trap to handle keyboard interrupt
signal.signal(signal.SIGINT, signal_handler)
print("[INFO] Press `ctrl + c` to exit, or 'q' to quit if you have" \
	" the display option on...")

# initialize the video writer and the frame dimensions (we'll set
# them as soon as we read the first frame from the video)
writer = None
W = None
H = None

Our initializations take place on Lines 30-52. Let’s review them:

  • Lines 30 and 31 instantiate our
  • Two status variables are initialized to determine when the fridge is open and when a notification has been sent (Lines 34 and 35).
  • We’ll start our
      on Lines 39-41. I’ve elected to use a PiCamera, so Line 39 (USB webcam) is commented out. You can easily swap these if you are using a USB webcam.
  • Line 44 starts our
      thread to run in the background.
  • Our video
      and frame dimensions are initialized on Lines 50-52.

It’s time to begin looping over frames:

# loop over the frames of the stream
while True:
	# grab both the next frame from the stream and the previous
	# refrigerator status
	frame =
	fridgePrevOpen = fridgeOpen

	# quit if there was a problem grabbing a frame
	if frame is None:

	# resize the frame and convert the frame to grayscale
	frame = imutils.resize(frame, width=200)
	gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
	# if the frame dimensions are empty, set them
	if W is None or H is None:
		(H, W) = frame.shape[:2]


  loop begins on Line 55. We proceed to
  from our video stream (Line 58). The
  undergoes a sanity check on Lines 62 and 63 to determine if we have a legitimate image from our camera.

Line 59 sets our

  flag. The previous value must always be set at the beginning of the loop and it is based on the current value which will be determined later.


  is resized to a dimension that will look reasonable on a smartphone and also make for a smaller filesize for our MMS video (Line 66).

On Line 67, we create a grayscale image from

  — we’ll need this soon to determine the average amount of light in the frame.

Our dimensions are set via Lines 70 and 71 during the first iteration of the loop.

Now let’s determine if the refrigerator is open:

# calculate the average of all pixels where a higher mean
	# indicates that there is more light coming into the refrigerator
	mean = np.mean(gray)

	# determine if the refrigerator is currently open
	fridgeOpen = mean > conf["thresh"]

Determining if the refrigerator is open is a dead-simple, two-step process:

  1. Average all pixel intensities of our grayscale image (Line 75).
  2. Compare the average to the threshold value in our configuration (Line 78). I’m confident that a value of
      (in the
      file) will be an appropriate threshold for most refrigerators with a light that turns on and off as the door is opened and closed. That said, you may want to experiment with tweaking that value yourself.


  variable is simply a boolean indicating if the refrigerator is open or not.

Let’s now determine if we need to start capturing a video:

# if the fridge is open and previously it was closed, it means
	# the fridge has been just opened
	if fridgeOpen and not fridgePrevOpen:
		# record the start time
		startTime =

		# create a temporary video file and initialize the video
		# writer object
		tempVideo = TempFile(ext=".mp4")
		writer = cv2.VideoWriter(tempVideo.path, 0x21, 30, (W, H),

As shown by the conditional on Line 82, so long as the refrigerator was just opened (i.e. it was not previously opened), we will initialize our video


We’ll go ahead and grab the

 , create a
 , and initialize our video
  with the temporary file path (Lines 84-90).

Now we’ll handle the case where the refrigerator was previously open:

# if the fridge is open then there are 2 possibilities,
	# 1) it's left open for more than the *threshold* seconds. 
	# 2) it's closed in less than or equal to the *threshold* seconds.
	elif fridgePrevOpen:
		# calculate the time different between the current time and
		# start time
		timeDiff = ( - startTime).seconds

		# if the fridge is open and the time difference is greater
		# than threshold, then send a notification
		if fridgeOpen and timeDiff > conf["open_threshold_seconds"]:
			# if a notification has not been sent yet, then send a 
			# notification
			if not notifSent:
				# build the message and send a notification
				msg = "Intruder has left your fridge open!!!"

				# release the video writer pointer and reset the
				# writer object
				writer = None
				# send the message and the video to the owner and
				# set the notification sent flag
				tn.send(msg, tempVideo)
				notifSent = True

If the refrigerator was previously open, let’s check to ensure it wasn’t left open long enough to trigger an “Intruder has left your fridge open!” alert.

Kids can leave the refrigerator open by accident, or maybe after a holiday, you have a lot of food preventing the refrigerator door from closing all the way. You don’t want your food to spoil, so you may want these alerts!

For this message to be sent, the

  must be greater than the threshold set in the config (Lines 98-102).

This message will include a

  and video to you, as shown on Lines 107-117. The
  is defined, the
  is released, and the notification is set.

Let’s now take care of the most common scenario where the refrigerator was previously open, but now it is closed (i.e. some thief stole your food, or maybe it was you when you became hungry):

# check to see if the fridge is closed
		elif not fridgeOpen:
			# if a notification has already been sent, then just set 
			# the notifSent to false for the next iteration
			if notifSent:
				notifSent = False

			# if a notification has not been sent, then send a 
			# notification
				# record the end time and calculate the total time in
				# seconds
				endTime =
				totalSeconds = (endTime - startTime).seconds
				dateOpened ="%A, %B %d %Y")

				# build the message and send a notification
				msg = "Your fridge was opened on {} at {} " \
					"at {} for {} seconds.".format(dateOpened
					startTime.strftime("%I:%M%p"), totalSeconds)

				# release the video writer pointer and reset the
				# writer object
				writer = None
				# send the message and the video to the owner
				tn.send(msg, tempVideo)

The case beginning on Line 120 will send a video message indicating, “Your fridge was opened on {{ day }} at {{ time }} for {{ seconds }}.”

On Lines 123 and 124, our

  flag is reset if needed. If the notification was already sent, we set this value to
 , effectively resetting it for the next iteration of the loop.

Otherwise, if the notification has not been sent, we’ll calculate the

  the refrigerator was open (Lines 131 and 132). We’ll also record the date the door was opened (Line 133).


  string is populated with these values (Lines 136-138).

Then the video

  is released and the message and video are sent (Line 142-147).

Our final block finishes out the loop and performs cleanup:

# check to see if we should write the frame to disk
	if writer is not None:

# check to see if we need to release the video writer pointer
if writer is not None:

# cleanup the camera and close any open windows

To finish the loop, we’ll write the

  to the video
  object and then go back to the top to grab the next frame.

When the loop exits, the

  is released, and the video stream is stopped.

Great job! You made it through a simple IoT project using a Raspberry Pi and camera.

It’s now time to place the bait. I know my thief likes hummus as much as I do, so I ran to the store and came back to put it in the fridge.

RPi security camera results

Figure 6: My refrigerator is armed with an Internet of Things (IoT) Raspberry Pi, PiCamera, and Battery Pack. And of course, I’ve placed some hummus in there for me and the thief. I’ll also know if someone takes a New Belgium Dayblazer beer of mine.

When deploying the Raspberry Pi security camera in your refrigerator to catch the hummus bandit, you’ll need to ensure that it will continue to run without a wireless connection to your laptop.

There are two great options for deployment:

  1. Run the computer vision Python script on reboot.
  2. Leave a
      session running with the Python computer vision script executing within.

Be sure to visit the first link if you just want your Pi to run the script when you plug in power.

While this blog post isn’t the right place for a full screen demo, here are the basics:

  • Install screen via:
    sudo apt-get install screen
  • Open an SSH connection to your Pi and run it:
  • If the connection from your laptop to your Pi ever dies or is closed, don’t panic! The screen session is still running. You can reconnect by SSH’ing into the Pi again and then running
    screen -r
     . You’ll be back in your virtual window.
  • Keyboard shortcuts for screen:
    • “ctrl + a, c”: Creates a new “window”.
    • ctrl + a, p” and “ctrl + a, n”: Cycles through “previous” and “next” windows, respectively.
  • For a more in-depth review of
     , see the documentation. Here’s a screen keyboard shortcut cheat sheet.

Once you’re comfortable with starting a script on reboot or working with

 , grab a USB battery pack that can source enough current. Shown in Figure 4, we’re using a RavPower 2200mAh battery pack connected to the Pi power input. The product specs claim to charge an iPhone 6+ times, and it seems to run a Raspberry Pi for about +/-10 hours (depending on the algorithm) as well.

Go ahead and plug in the battery pack, connect, and deploy the script (if you didn’t set it up to start on boot).

The commands are:

$ screen
# wait for screen to start
$ source ~/.profile
$ workon <env_name> # insert the name of your virtual environment
$ python --conf config/config.json

If you aren’t familiar with command line arguments, please read this tutorial. The command line argument is also required if you are deploying the script upon reboot.

Let’s see it in action!

Figure 7: Me testing the Pi Security Camera notifications with my iPhone.

I’ve included a full deme of the Raspberry Pi security camera below:

Interested in building more projects with the Raspberry Pi, OpenCV, and computer vision?

Figure 8: Catching a furry little raccoon with an infrared light/camera connected to the Raspberry Pi.

Are you interested in using your Raspberry Pi to build practical, real-world computer vision and deep learning applications, including:

  • Computer vision and IoT projects on the Pi
  • Servos, PID, and controlling the Pi with computer vision
  • Human activity, home surveillance, and facial applications
  • Deep learning on the Raspberry Pi
  • Fast, efficient deep learning with the Movidius NCS and OpenVINO toolkit
  • Self-driving car applications on the Raspberry Pi
  • Tips, suggestions, and best practices when performing computer vision and deep learning with the Raspberry Pi

If so, you’ll definitely want to check out my upcoming book, Raspberry Pi for Computer Visionto learn more about the book (including release date information) just click the link below and enter your email address:

From there I’ll ensure you’re kept in the know on the RPi + Computer Vision book, including updates, behind the scenes looks, and release date information.


In this tutorial, you learned how to build a Raspberry Pi security camera from scratch using OpenCV and computer vision.

Specifically, you learned how to:

  • Access the Raspberry Pi camera module or USB webcam.
  • Setup your Amazon AWS/S3 account so you can upload images/video when your security camera is triggered (other services such as Dropbox, Box, Google Drive, etc. will work as well, provided you can obtain a public-facing URL of the media).
  • Obtain Twilio API keys used to send text messages with the uploaded images/video.
  • Create a Raspberry Pi security camera using OpenCV and computer vision.

Finally, we put all the pieces together and deployed the security camera to monitor a refrigerator:

  • Each time the door was opened we started recording
  • After the door was closed the recording stopped
  • The recording was then uploaded to the cloud
  • And finally, a text message was sent to our phone showing the activity

You can extend the security camera to include other components as well. My first suggestion would be to take a look at how to build a home surveillance system using a Raspberry Pi where we use a more advanced motion detection technique. It would be fun to implement Twilio SMS/MMS notifications into the home surveillance project as well.

I hope you enjoyed this tutorial!

To download the source code to this post, and be notified when future tutorials are published here on PyImageSearch, just enter your email address in the form below!


If you would like to download the code and images used in this post, please enter your email address in the form below. Not only will you get a .zip of the code, I’ll also send you a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL! Sound good? If so, enter your email address and I’ll send you the code immediately!

The post Building a Raspberry Pi security camera with OpenCV appeared first on PyImageSearch.

Continue Reading…


Read More

Feature Reduction using Genetic Algorithm with Python

This tutorial discusses how to use the genetic algorithm (GA) for reducing the feature vector extracted from the Fruits360 dataset in Python mainly using NumPy and Sklearn.

Continue Reading…


Read More

Mister P for surveys in epidemiology — using Stan!

Jon Zelner points us to this new article in the American Journal of Epidemiology, “Multilevel Regression and Poststratification: A Modelling Approach to Estimating Population Quantities From Highly Selected Survey Samples,” by Marnie Downes, Lyle Gurrin, Dallas English, Jane Pirkis, Dianne Currier, Matthew Spittal, and John Carlin, which begins:

Large-scale population health studies face increasing difficulties in recruiting representative samples of participants. Non-participation, item non-response and attrition, when follow-up is involved, often result in highly selected samples even in well-designed studies. We aimed to assess the potential value of multilevel regression and poststratification, a method previously used to successfully forecast US presidential election results, for addressing biases due to non-participation in the estimation of population descriptive quantities in large cohort studies. The investigation was performed as an extensive case study using a large national health survey of Australian males, the Ten to Men study. Analyses were performed in the Bayesian computational package RStan. Results showed greater consistency and precision across population subsets of varying sizes, when compared with estimates obtained using conventional survey sampling weights. Estimates for smaller population subsets exhibited a greater degree of shrinkage towards the national estimate. Multilevel regression and poststratification provides a promising analytic approach to addressing potential participation bias in the estimation of population descriptive quantities from large-scale health surveys and cohort studies.

It makes me so happy to see our methods used in new problems like this!

I’ve been dealing with all sorts of crap during the past week or so, so it’s good to be reminded of how our work can make a difference.

Continue Reading…


Read More

quantmod_0.4-14 on CRAN

(This article was first published on FOSS Trading, and kindly contributed to R-bloggers)

I just pushed a new release of quantmod to CRAN! I’m most excited about the update to getSymbols() so it doesn’t throw an error and stop processing if there’s a problem with one ticker symbol. Now getSymbols() will import all the data it can, and provide an informative error message for any ticker symbols it could not import.

Another cool feature is that getQuote() can now import quotes from Tiingo. But don’t thank me; thank Ethan Smith for the feature request [#247] and pull request [#250].

There are also several bug fixes in this release.  The most noticeable are fixes to getDividends()  and getSplits()Yahoo! Finance continues to have stability issues. Now it returns raw dividends instead of split-adjusted dividends (thanks to Douglas Barnard for the report [#253]), and the actual split adjustment ratio instead of the inverse (e.g. now 1/2 instead of 2/1).  I suggest using a different data provider. See my post: Yahoo! Finance Alternatives for some suggestions.

See the news file for the other bug fixes. Please let me know what you think about these changes.  I need your feedback and input to make quantmod even better!

To leave a comment for the author, please follow the link and comment on their blog: FOSS Trading. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Four short links: 25 March 2019

Hiring for Neurodiversity, Reprogrammable Molecular Computing, Retro UUCP, and Industrial Go

  1. Dell's Neurodiversity Program -- excellent work from Dell making themselves an attractive destination for folks on the autistic spectrum.
  2. Reprogrammable Molecular Computing System (Caltech) -- The researchers were able to experimentally demonstrate 6-bit molecular algorithms for a diverse set of tasks. In mathematics, their circuits tested inputs to assess if they were multiples of three, performed equality checks, and counted to 63. Other circuits drew "pictures" on the DNA "scarves," such as a zigzag, a double helix, and irregularly spaced diamonds. Probabilistic behaviors were also demonstrated, including random walks as well as a clever algorithm (originally developed by computer pioneer John von Neumann) for obtaining a fair 50/50 random choice from a biased coin. Paper.
  3. Dataforge UUCP -- it's like Cory Doctorow guestwrote our timeline: UUCP over SSH to give decentralized comms for freedom fighters.
  4. Go for Industrial Programming (Peter Bourgon) -- I’m speaking today about programming in an industrial context. By that I mean: in a startup or corporate environment; within a team where engineers come and go; on code that outlives any single engineer; and serving highly mutable business requirements. [...] I’ve tried to select for areas that have routinely tripped up new and intermediate Gophers in organizations I’ve been a part of, and particularly those things that may have nonobvious or subtle implications. (via ceej)

Continue reading Four short links: 25 March 2019.

Continue Reading…


Read More

Data for 200M traffic stop records

The Stanford Open Policing Project just released a dataset for police traffic stops across the country:

Currently, a comprehensive, national repository detailing interactions between police and the public doesn’t exist. That’s why the Stanford Open Policing Project is collecting and standardizing data on vehicle and pedestrian stops from law enforcement departments across the country — and we’re making that information freely available. We’ve already gathered over 200 million records from dozens of state and local police departments across the country.

You can download the data as CSV or RDS, and there are fields for stop date, stop time, location, driver demographics, and reasons for the stop. As you might imagine, the data from various municipalities comes at varying degrees of detail and timespans. I imagine there’s a lot to learn here both from the data and from working with the data.

Tags: , ,

Continue Reading…


Read More

Distilled News

Evaluating Machine Learning Models Fairness and Bias.

Evaluating machine learning models for bias is becoming an increasingly common focus for different industries and data researchers. Model Fairness is a relatively new subfield in Machine Learning. In the past, the study of discrimination emerged from analyzing human-driven decisions and the rationale behind those decisions. Since we started to rely on predictive ML models to make decisions for different industries such as insurance and banking, we need to implement strategies to ensure the fairness of those models and detect any discriminative behaviour during predictions.

Generating Synthetic Classification Data using Scikit

Data generators help us create data with different distributions and profiles to experiment on. If you are testing various algorithms available to you and you want to find which one works in what cases, then these data generators can help you generate case specific data and then test the algorithm. For example you want to check whether gradient boosting trees can do well given just 100 data-points and 2 features? Now either you can search for a 100 data-points dataset, or you can use your own dataset that you are working on. But how would you know if the classifier was a good choice, given that you have so less data and doing cross validation and testing still leaves fair chance of overfitting? Or rather you could use generated data and see what usually works well for such a case, a boosting algorithm or a linear model.

Speed up predictions on low-power devices using Neural Compute Stick and OpenVINO

The Neural Compute Stick, by Intel, is able to accelerate Tensorflow neural network inferences on the edge, improving performances by 10x factor.
In this article, we will explore the procedure required to:
1. Convert a Tensorflow model to NCS compatible one, using OpenVINO Toolkit by Intel
2. Install a light version of OpenVINO on Raspberry, to run inferences onboard
3. Test and deploy the converted model on Raspberry

The analytics translator The Must-Have Role for AI-Driven Organizations.

» Many organizations have not seen return on their investment after developing their data and AI capabilities.
» It’s imperative to account for all of the phases of an AI solution life-cycle. Find the right business problems to solve in the Ideation phase, discover if there is a viable business model during an Experimentation phase, and scale up in an Industrialization phase.
» Actively involving the business in every step of the process and putting them in the driver’s seat is a critical element to success with data and AI.
» The analytics translator enables the execution of your company’s AI strategy by finding the right use cases, liaising between business and data experts, and embedding AI solutions into your organization.
» To be successful, an analytics translator needs deep business und

Towards Automatic Text Summarization: Extractive Methods

For those who had academic writing, summarization – the task of producing a concise and fluent summary while preserving key information content and overall meaning – was if not a nightmare, then a constant challenge close to guesswork to detect what the professor would find important. Though the basic idea looks simple: find the gist, cut off all opinions and detail, and write a couple of perfect sentences, the task inevitably ended up in toil and turmoil.

Clean a complex dataset for modelling with recommendation algorithms

Recently I wanted to learn something new and challenged myself to carry out an end-to-end Market Basket Analysis. To continue to challenge myself, I’ve decided to put the results of my efforts before the eyes of the data science community. And what better forum for my first ever series of posts than one of my favourite data science blogs!

Unit Tests in R

I am collecting here some notes on testing in R. There seems to be a general (false) impression among non R-core developers that to run tests, R package developers need a test management system such as RUnit or testthat. And a further false impression that testthat is the only R test management system. This is in fact not true, as R itself has a capable testing facility in ‘R CMD check’ (a command triggering R checks from outside of any given integrated development environment).

How to Automatically Determine the Number of Clusters in your Data – and more

Determining the number of clusters when performing unsupervised clustering is a tricky problem. Many data sets don’t exhibit well separated clusters, and two human beings asked to visually tell the number of clusters by looking at a chart, are likely to provide two different answers. Sometimes clusters overlap with each other, and large clusters contain sub-clusters, making a decision not easy. For instance, how many clusters do you see in the picture below? What is the optimum number of clusters? No one can tell with certainty, not AI, not a human being, not an algorithm.

Visually explore Probability Distributions with vistributions

We are happy to introduce the vistributions package, a set of tools for visually exploring probability distributions.

Zotero hacks: unlimited synced storage and its smooth use with rmarkdown

Here is a bit refreshed translation of my 2015 blog post, initially published on Russian blog platform The post shows how to organize a personal academic library of unlimited size for free. This is a funny case of a self written manual which I came back to multiple times myself and many many more times referred my friends to it, even non-Russian speakers who had to use Google Translator and infer the rest from screenshots. Finally, I decided to translate it adding some basic information on how to use Zotero with rmarkdown.

Variance decomposition and price segmentation in Insurance

Variance decomposition and price segmentation in Insurance

On the poor performance of classifiers in insurance models

Each time we have a case study in my actuarial courses (with real data), students are surprised to have hard time getting a ‘good’ model, and they are always surprised to have a low AUC, when trying to model the probability to claim a loss, to die, to fraud, etc. And each time, I keep saying, ‘yes, I know, and that’s what we expect because there a lot of ‘randomness’ in insurance’. To be more specific, I decided to run some simulations, and to compute AUCs to see what’s going on. And because I don’t want to waste time fitting models, we will assume that we have each time a perfect model. So I want to show that the upper bound of the AUC is actually quite low ! So it’s not a modeling issue, it is a fondamental issue in insurance !

5 Amazing Deep Learning Frameworks Every Data Scientist Must Know! (with Illustrated Infographic)

Table of Contents
1. What is a Deep Learning Framework?
2. TensorFlow
3. Keras
4. PyTorch
5. Caffe
6. Deeplearning4j
7. Comparing these Deep Learning Frameworks

Better Parallelization with Numba

Based on a geocoordinate problem posed on stackoverflow, I implemented solutions utilizing Numba: 500x faster on multiple cores, 7500x faster on GPU (RTX 2070)

Why Artificial Intelligence Needs to breath on Blockchain ?

Consider placing an AI bot on a blockchain and initiate the phase of deep learning, what would the end result be? Would it be detrimental to the survival of human race or would it lead to a never ending loop that removes third parties from transactions making it easier for for everyone to procure goods and services In theory, the combination of both Blockchain and AI fuse to create a foundation that can foster the change in current methods of transactions. It’s adoption rate are sluggish, the determinants of this adoption rate are more to do with the human adaptability within the financial culture along with the complexity of this conventional way of transactions.

Continue Reading…


Read More

Document worth reading: “Learning Deep Representations for Semantic Image Parsing: a Comprehensive Overview”

Semantic image parsing, which refers to the process of decomposing images into semantic regions and constructing the structure representation of the input, has recently aroused widespread interest in the field of computer vision. The recent application of deep representation learning has driven this field into a new stage of development. In this paper, we summarize three aspects of the progress of research on semantic image parsing, i.e., category-level semantic segmentation, instance-level semantic segmentation, and beyond segmentation. Specifically, we first review the general frameworks for each task and introduce the relevant variants. The advantages and limitations of each method are also discussed. Moreover, we present a comprehensive comparison of different benchmark datasets and evaluation metrics. Finally, we explore the future trends and challenges of semantic image parsing. Learning Deep Representations for Semantic Image Parsing: a Comprehensive Overview

Continue Reading…


Read More

Book Memo: “Hands-On Unsupervised Learning Using Python”

How to Build Applied Machine Learning Solutions from Unlabeled Data
Many industry experts consider unsupervised learning the next AI frontier, one that may hold the key to general artificial intelligence. Armed with the conceptual knowledge in this book, data scientists and machine learning practitioners will learn hands-on how to apply unsupervised learning to large unlabeled datasets using Python tools. You’ll uncover hidden patterns, gain deeper business insight, detect anomalies, perform automatic feature engineering and selection, and generate synthetic datasets. Author Ankur Patel-an applied machine-learning researcher and data scientist with expertise in financial markets-provides the concepts, intuition, and tools necessary for you to apply this technology to problems you tackle every day. Through the course of this book, you’ll learn how to build production-ready systems with Python.

Continue Reading…


Read More

Play with the cyphr package

(This article was first published on Shige's Research Blog, and kindly contributed to R-bloggers)

The cyphr package seems to provide a good choice for small research group that shares sensitive data over internet (e.g., DropBox). I did some simple experiment myself and made sure it can actually serve my purpose.

I did my experiment on two computers (using openssl): I created the test data on my Linux workstation running Manjaro then I tried to access the data on a Windows 7 laptop.

For creating the data (Linux workstation):


# Create the test data

data_dir <- file.path(“~/Dropbox/temp_files”, “data”)

# Encrypt the test data


key <- cyphr::data_key(data_dir)

filename <- file.path(data_dir, “iris.rds”)

cyphr::encrypt(saveRDS(iris, filename), key)

# Cannot read the data with decrypting it


# Read the decrypted version of the data

head(cyphr::decrypt(readRDS(filename), key))

For accessing the data (Windows laptop):


key <- data_key(“C:/Users/Ssong/Dropbox/temp_files/data”, path_user = “C:/Users/Ssong/.ssh”)

# Make data access request

path_user = “C:/Users/Ssong/.ssh”)

On Windows 7,  the system cannot locate the public located in “~/.ssh”, which is pretty dumb.

Going back to the Linux workstation to approve the data access request:

# Review the request and approve (to share with other users)
req <- data_admin_list_requests(data_dir)
data_admin_authorise(data_dir, yes = TRUE)

Now I can access the data on my Windows laptop:

key <- data_key(“C:/Users/Ssong/Dropbox/temp_files/data”, path_user = “C:/Users/Ssong/.ssh”)

d <- decrypt( readRDS( “C:/Users/Ssong/Dropbox/temp_files/data/iris.rds”), key)

To leave a comment for the author, please follow the link and comment on their blog: Shige's Research Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Jonathan (another one) does Veronica Geng does Robert Mueller

Frequent commenter Jonathan (another one) writes:

I realize that so many people bitch about the seminar showdown that you might need at one thank you. This year, I managed to re-read the bulk of Geng, and for that I thank you. I have not yet read any Sattouf, but it clearly has made an impression on you, so it’s on my list.

In thanks, my first brief foray into pseudo-Gengiana, I think I’ve got the tone roughly right, but I’m way short on whimsy, but this is what I managed in a sustained fifteen minute effort. Thanks again.

My fellow Americans:

As you are no doubt aware, I have completed my investigation and report. I write this to inform you of an unfortunate mishap from Friday. Many news outlets have reported that my final report was taken by security guard from my offices to the Justice Department. That is not true. In an attempt to maintain my obsessive secrecy, that was a dummy report, actually containing the text of an unpublished novel by David Foster Wallace that we found in Michael Cohen’s safe. We couldn’t understand it—maybe Bill Barr will have better luck.

The real one was handed to my intern, Jeff, in an ordinary interoffice envelope, and Jeff was told to drop it off at Justice on his way home. He lives nearby with six other interns. Not knowing what he had, he stopped off at the Friday Trivia Happy Hour at the Death and Taxes Pub, drank a little too much, and left the report there. We’ve gone back to look and nobody can find it.
So why not just print out another one? Or for that matter, why didn’t I just email the first report? As you’ve no doubt gleaned by now, computers and email aren’t my thing. As my successor at the FBI, Mr. Comey, demonstrated, email baffles just about all of us. And I don’t use a computer. So there isn’t another copy of the real report. I’ve got all my notes, though, so I ought to be able to cobble together a new report in a couple of months.

Apologies for the delay,
Robert Mueller

PS: Jeff has been chastised. We haven’t fired him, but in asking him about this he let slip that his parents didn’t pay taxes on the nanny who raised him and they may have strongly implied that he played on a high school curling team to get into college. His parents are going to jail and the nanny’s immigration status is being investigated. This requires a short re-opening of the investigation.

The mention of “Jeff” seems particularly Geng-like to me. Perhaps I’m reminded of “Ed.” Thinking of Geng makes me a bit sad, though, not just for her but because it reminds me of the passage of time. I associate Geng, Bill James, and Spy magazine with the mid-1980s. Ahhh, lost youth!

Continue Reading…


Read More

R Packages worth a look

Robust Re-Scaling to Better Recover Latent Effects in Data (rrscale)
Non-linear transformations of data to better discover latent effects. Applies a sequence of three transformations (1) a Gaussianizing transformation, ( …

Create Reproducible Research Projects (rosr)
Creates reproducible academic projects with integrated academic elements, including datasets, references, codes, images, manuscripts, dissertations, sl …

Group Sequential Design Class for Clinical Trials (seqmon)
S4 class object for creating and managing group sequential designs. It calculates the efficacy and futility boundaries at each look. It allows modifyin …

Tools for Managing SSH and Git Credentials (credentials)
Setup and retrieve HTTPS and SSH credentials for use with ‘git’ and other services. For HTTPS remotes the package interfaces the ‘git-credential’ utili …

Clustering and Classification Inference with U-Statistics (uclust)
Clustering and classification inference for high dimension low sample size (HDLSS) data with U-statistics. The package contains implementations of nonp …

Portfolio Safeguard: Optimization, Statistics and Risk Management (PSGExpress)
Solves optimization, advanced statistics, and risk management problems. Popular nonlinear functions in financial, statistical, and logistics applicatio …

Continue Reading…


Read More

Male journalists dominate the news

Two-thirds of bylines in American reporting credit men

Continue Reading…


Read More

Summer Interns 2019

We received almost 400 applications for our 2019 internship program from students with very diverse backgrounds. After interviewing several dozen people and making some very difficult decisions, we are pleased to announce that these twelve interns have accepted positions with us for this summer:

  • Therese Anders: Calibrated Peer Review. Prototype tools to conduct experiments to see whether calibrated peer review is a useful and feasible feedback strategy in introductory data science classes and industry workshops. (mentor: Mine Çetinkaya-Rundel)

  • Malcolm Barrett: R Markdown Enhancements. Tidy up and refactoring the R Markdown code base. (mentor: Rich Iannone)

  • Julia Blum: RStudio Community Sustainability. Study, enhance documentation and processes, and onboard new users. (mentor: Curtis Kephart)

  • Joyce Cahoon: Object Scrubbers. Help write a set of methods to scrub different types of objects to reduce their size on disk. (mentors: Max Kuhn and Davis Vaughan)

  • Daniel Chen: Grader Enhancements. Enhance [grader]( to identify students’ mistakes when doing automated tutorials. (mentor: Garrett Grolemund)

  • Marly Cormar: Production Testing Tools for Data Science Pipelines. Build on applicability domain methods from computational chemistry to create functions that can be included in a dplyr pipeline to perform statistical checks on data in production. (mentor: Max Kuhn)

  • Desiree De Leon: Teaching and Learning with RStudio. Create a one-stop guide to teaching with RStudio similar to Teaching and Learning with Jupyter. (mentor: Alison Hill)

  • Dewey Dunnington: ggplot2 Enhancements. Contribute to ggplot2 or an associated package (like scales) by writing R code for graphics and helping to manage a large, popular open source project. (mentor: Hadley Wickham)

  • Maya Gans: Tidy Blocks. Prototype and evaluate a block-based version of the tidyverse so that young students can do simple analysis using an interface like Scratch. (mentor: Greg Wilson)

  • Leslie Huang: Shiny Enhancements. Enhance Shiny’s UI, improve performance bottlenecks, fix bugs, and create a set of higher-order reactives for more sophisticated programming. (mentor: Barret Schloerke)

  • Grace Lawley: Tidy Practice. Develop practice projects so learners can practice tidyverse skills using interesting real-world data. (mentor: Alison Hill)

  • Yim Register: Data Science Training for Software Engineers. Develop course materials to teach basic data analysis to programmers using software engineering problems and data sets. (mentor: Greg Wilson)

We are very excited to welcome them all to the RStudio family, and we hope you’ll enjoy following their progress over the summer.

Continue Reading…


Read More

March 24, 2019

If you did not already know

Relational Forward Model (RFM) google
The behavioral dynamics of multi-agent systems have a rich and orderly structure, which can be leveraged to understand these systems, and to improve how artificial agents learn to operate in them. Here we introduce Relational Forward Models (RFM) for multi-agent learning, networks that can learn to make accurate predictions of agents’ future behavior in multi-agent environments. Because these models operate on the discrete entities and relations present in the environment, they produce interpretable intermediate representations which offer insights into what drives agents’ behavior, and what events mediate the intensity and valence of social interactions. Furthermore, we show that embedding RFM modules inside agents results in faster learning systems compared to non-augmented baselines. As more and more of the autonomous systems we develop and interact with become multi-agent in nature, developing richer analysis tools for characterizing how and why agents make decisions is increasingly necessary. Moreover, developing artificial agents that quickly and safely learn to coordinate with one another, and with humans in shared environments, is crucial. …

Auto-Encoding Variational Bayes google
How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions is two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.

CrossE google
Knowledge graph embedding aims to learn distributed representations for entities and relations, and is proven to be effective in many applications. Crossover interactions — bi-directional effects between entities and relations — help select related information when predicting a new triple, but haven’t been formally discussed before. In this paper, we propose CrossE, a novel knowledge graph embedding which explicitly simulates crossover interactions. It not only learns one general embedding for each entity and relation as most previous methods do, but also generates multiple triple specific embeddings for both of them, named interaction embeddings. We evaluate embeddings on typical link prediction tasks and find that CrossE achieves state-of-the-art results on complex and more challenging datasets. Furthermore, we evaluate embeddings from a new perspective — giving explanations for predicted triples, which is important for real applications. In this work, an explanation for a triple is regarded as a reliable closed-path between the head and the tail entity. Compared to other baselines, we show experimentally that CrossE, benefiting from interaction embeddings, is more capable of generating reliable explanations to support its predictions. …

Continue Reading…


Read More

Whats new on arXiv

Deep Text-to-Speech System with Seq2Seq Model

Recent trends in neural network based text-to-speech/speech synthesis pipelines have employed recurrent Seq2seq architectures that can synthesize realistic sounding speech directly from text characters. These systems however have complex architectures and takes a substantial amount of time to train. We introduce several modifications to these Seq2seq architectures that allow for faster training time, and also allows us to reduce the complexity of the model architecture at the same time. We show that our proposed model can achieve attention alignment much faster than previous architectures and that good audio quality can be achieved with a model that’s much smaller in size. Sample audio available at https://…/tts-samples-for-cmpt-419.

An ‘On The Fly’ Framework for Efficiently Generating Synthetic Big Data Sets

Collecting, analyzing and gaining insight from large volumes of data is now the norm in an ever increasing number of industries. Data analytics techniques, such as machine learning, are powerful tools used to analyze these large volumes of data. Synthetic data sets are routinely relied upon to train and develop such data analytics methods for several reasons: to generate larger data sets than are available, to generate diverse data sets, to preserve anonymity in data sets with sensitive information, etc. Processing, transmitting and storing data is a key issue faced when handling large data sets. This paper presents an ‘On the fly’ framework for generating big synthetic data sets, suitable for these data analytics methods, that is both computationally efficient and applicable to a diverse set of problems. An example application of the proposed framework is presented along with a mathematical analysis of its computational efficiency, demonstrating its effectiveness.

Self-Organization and Artificial Life

Self-organization can be broadly defined as the ability of a system to display ordered spatio-temporal patterns solely as the result of the interactions among the system components. Processes of this kind characterize both living and artificial systems, making self-organization a concept that is at the basis of several disciplines, from physics to biology to engineering. Placed at the frontiers between disciplines, Artificial Life (ALife) has heavily borrowed concepts and tools from the study of self-organization, providing mechanistic interpretations of life-like phenomena as well as useful constructivist approaches to artificial system design. Despite its broad usage within ALife, the concept of self-organization has been often excessively stretched or misinterpreted, calling for a clarification that could help with tracing the borders between what can and cannot be considered self-organization. In this review, we discuss the fundamental aspects of self-organization and list the main usages within three primary ALife domains, namely ‘soft’ (mathematical/computational modeling), ‘hard’ (physical robots), and ‘wet’ (chemical/biological systems) ALife. Finally, we discuss the usefulness of self-organization within ALife studies, point to perspectives for future research, and list open questions.

Dying ReLU and Initialization: Theory and Numerical Examples

The dying ReLU refers to the problem when ReLU neurons become inactive and only output 0 for any input. There are many empirical and heuristic explanations on why ReLU neurons die. However, little is known about its theoretical analysis. In this paper, we rigorously prove that a deep ReLU network will eventually die in probability as the depth goes to infinite. Several methods have been proposed to alleviate the dying ReLU. Perhaps, one of the simplest treatments is to modify the initialization procedure. One common way of initializing weights and biases uses symmetric probability distributions, which suffers from the dying ReLU. We thus propose a new initialization procedure, namely, a randomized asymmetric initialization. We prove that the new initialization can effectively prevent the dying ReLU. All parameters required for the new initialization are theoretically designed. Numerical examples are provided to demonstrate the effectiveness of the new initialization procedure.

Probabilistic Temperature Forecasting with a Heteroscedastic Autoregressive Ensemble Postprocessing model

Weather prediction today is performed with numerical weather prediction (NWP) models. These are deterministic simulation models describing the dynamics of the atmosphere, and evolving the current conditions forward in time to obtain a prediction for future atmospheric states. To account for uncertainty in NWP models it has become common practice to employ ensembles of NWP forecasts. However, NWP ensembles often exhibit forecast biases and dispersion errors, thus require statistical postprocessing to improve reliability of the ensemble forecasts. This work proposes an extension of a recently developed postprocessing model utilizing autoregressive information present in the forecast error of the raw ensemble members. The original approach is modified to let the variance parameter depend on the ensemble spread, yielding a two-fold heteroscedastic model. Furthermore, an additional high-resolution forecast is included into the postprocessing model, yielding improved predictive performance. Finally, it is outlined how the autoregressive model can be utilized to postprocess ensemble forecasts with higher forecast horizons, without the necessity of making fundamental changes to the original model. We accompany the new methodology by an implementation within the R package ensAR to make our method available for other researchers working in this area. To illustrate the performance of the heteroscedastic extension of the autoregressive model, and its use for higher forecast horizons we present a case study for a data set containing 12 years of temperature forecasts and observations over Germany. The case study indicates that the autoregressive model yields particularly strong improvements for forecast horizons beyond 24 hours.

A Data Mining Approach to Flight Arrival Delay Prediction for American Airlines

In the present scenario of domestic flights in USA, there have been numerous instances of flight delays and cancellations. In the United States, the American Airlines, Inc. have been one of the most entrusted and the world’s largest airline in terms of number of destinations served. But when it comes to domestic flights, AA has not lived up to the expectations in terms of punctuality or on-time performance. Flight Delays also result in airline companies operating commercial flights to incur huge losses. So, they are trying their best to prevent or avoid Flight Delays and Cancellations by taking certain measures. This study aims at analyzing flight information of US domestic flights operated by American Airlines, covering top 5 busiest airports of US and predicting possible arrival delay of the flight using Data Mining and Machine Learning Approaches. The Gradient Boosting Classifier Model is deployed by training and hyper-parameter tuning it, achieving a maximum accuracy of 85.73%. Such an Intelligent System is very essential in foretelling flights’on-time performance.

Algorithms for Verifying Deep Neural Networks

Deep neural networks are widely used for nonlinear function approximation with applications ranging from computer vision to control. Although these networks involve the composition of simple arithmetic operations, it can be very challenging to verify whether a particular network satisfies certain input-output properties. This article surveys methods that have emerged recently for soundly verifying such properties. These methods borrow insights from reachability analysis, optimization, and search. We discuss fundamental differences and connections between existing algorithms. In addition, we provide pedagogical implementations of existing methods and compare them on a set of benchmark problems.

Spherical Principal Component Analysis

Principal Component Analysis (PCA) is one of the most important methods to handle high dimensional data. However, most of the studies on PCA aim to minimize the loss after projection, which usually measures the Euclidean distance, though in some fields, angle distance is known to be more important and critical for analysis. In this paper, we propose a method by adding constraints on factors to unify the Euclidean distance and angle distance. However, due to the nonconvexity of the objective and constraints, the optimized solution is not easy to obtain. We propose an alternating linearized minimization method to solve it with provable convergence rate and guarantee. Experiments on synthetic data and real-world datasets have validated the effectiveness of our method and demonstrated its advantages over state-of-art clustering methods.

Ontology Based Global and Collective Motion Patterns for Event Classification in Basketball Videos

In multi-person videos, especially team sport videos, a semantic event is usually represented as a confrontation between two teams of players, which can be represented as collective motion. In broadcast basketball videos, specific camera motions are used to present specific events. Therefore, a semantic event in broadcast basketball videos is closely related to both the global motion (camera motion) and the collective motion. A semantic event in basketball videos can be generally divided into three stages: pre-event, event occurrence (event-occ), and post-event. In this paper, we propose an ontology-based global and collective motion pattern (On_GCMP) algorithm for basketball event classification. First, a two-stage GCMP based event classification scheme is proposed. The GCMP is extracted using optical flow. The two-stage scheme progressively combines a five-class event classification algorithm on event-occs and a two-class event classification algorithm on pre-events. Both algorithms utilize sequential convolutional neural networks (CNNs) and long short-term memory (LSTM) networks to extract the spatial and temporal features of GCMP for event classification. Second, we utilize post-event segments to predict success/failure using deep features of images in the video frames (RGB_DF_VF) based algorithms. Finally the event classification results and success/failure classification results are integrated to obtain the final results. To evaluate the proposed scheme, we collected a new dataset called NCAA+, which is automatically obtained from the NCAA dataset by extending the fixed length of video clips forward and backward of the corresponding semantic events. The experimental results demonstrate that the proposed scheme achieves the mean average precision of 59.22% on NCAA+. It is higher by 7.62% than state-of-the-art on NCAA.

Minimizing Age of Information in Cognitive Radio-Based IoT Networks: Underlay or Overlay?

We consider a cognitive radio-based Internet-of-Things (CR-IoT) network consisting of one primary IoT (PIoT) system and one secondary IoT (SIoT) system. The IoT devices of both the PIoT and the SIoT respectively monitor one physical process and send randomly generated status updates to their associated access points (APs). The timeliness of the status updates is important as the systems are interested in the latest condition (e.g., temperature, speed and position) of the IoT device. In this context, two natural questions arise: (1) How to characterize the timeliness of the status updates in CR-IoT systems? (2) Which scheme, overlay or underlay, is better in terms of the timeliness of the status updates. To answer these two questions, we adopt a new performance metric, named the age of information (AoI). We analyze the average peak AoI of the PIoT and the SIoT for overlay and underlay schemes, respectively. Simple asymptotic expressions of the average peak AoI are also derived when the PIoT operates at high signal-to-noise ratio (SNR). Based on the asymptotic expressions, we characterize a critical generation rate of the PIoT system, which can determine the superiority of overlay and underlay schemes in terms of the average peak AoI of the SIoT. Numerical results validate the theoretical analysis and uncover that the overlay and underlay schemes can outperform each other in terms of the average peak AoI of the SIoT for different system setups.

Emotion Action Detection and Emotion Inference: the Task and Dataset

Many Natural Language Processing works on emotion analysis only focus on simple emotion classification without exploring the potentials of putting emotion into ‘event context’, and ignore the analysis of emotion-related events. One main reason is the lack of this kind of corpus. Here we present Cause-Emotion-Action Corpus, which manually annotates not only emotion, but also cause events and action events. We propose two new tasks based on the data-set: emotion causality and emotion inference. The first task is to extract a triple (cause, emotion, action). The second task is to infer the probable emotion. We are currently releasing the data-set with 10,603 samples and 15,892 events, basic statistic analysis and baseline on both emotion causality and emotion inference tasks. Baseline performance demonstrates that there is much room for both tasks to be improved.

A Deep Look into Neural Ranking Models for Information Retrieval

Ranking models lie at the heart of research on information retrieval (IR). During the past decades, different techniques have been proposed for constructing ranking models, from traditional heuristic methods, probabilistic methods, to modern machine learning methods. Recently, with the advance of deep learning technology, we have witnessed a growing body of work in applying shallow or deep neural networks to the ranking problem in IR, referred to as neural ranking models in this paper. The power of neural ranking models lies in the ability to learn from the raw text inputs for the ranking problem to avoid many limitations of hand-crafted features. Neural networks have sufficient capacity to model complicated tasks, which is needed to handle the complexity of relevance estimation in ranking. Since there have been a large variety of neural ranking models proposed, we believe it is the right time to summarize the current status, learn from existing methodologies, and gain some insights for future development. In contrast to existing reviews, in this survey, we will take a deep look into the neural ranking models from different dimensions to analyze their underlying assumptions, major design principles, and learning strategies. We compare these models through benchmark tasks to obtain a comprehensive empirical understanding of the existing techniques. We will also discuss what is missing in the current literature and what are the promising and desired future directions.

Spatiotemporal Feature Learning for Event-Based Vision

Unlike conventional frame-based sensors, event-based visual sensors output information through spikes at a high temporal resolution. By only encoding changes in pixel intensity, they showcase a low-power consuming, low-latency approach to visual information sensing. To use this information for higher sensory tasks like object recognition and tracking, an essential simplification step is the extraction and learning of features. An ideal feature descriptor must be robust to changes involving (i) local transformations and (ii) re-appearances of a local event pattern. To that end, we propose a novel spatiotemporal feature representation learning algorithm based on slow feature analysis (SFA). Using SFA, smoothly changing linear projections are learnt which are robust to local visual transformations. In order to determine if the features can learn to be invariant to various visual transformations, feature point tracking tasks are used for evaluation. Extensive experiments across two datasets demonstrate the adaptability of the spatiotemporal feature learner to translation, scaling and rotational transformations of the feature points. More importantly, we find that the obtained feature representations are able to exploit the high temporal resolution of such event-based cameras in generating better feature tracks.

swCaffe: a Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight

This paper reports our efforts on swCaffe, a highly efficient parallel framework for accelerating deep neural networks (DNNs) training on Sunway TaihuLight, the current fastest supercomputer in the world that adopts a unique many-core heterogeneous architecture, with 40,960 SW26010 processors connected through a customized communication network. First, we point out some insightful principles to fully exploit the performance of the innovative many-core architecture. Second, we propose a set of optimization strategies for redesigning a variety of neural network layers based on Caffe. Third, we put forward a topology-aware parameter synchronization scheme to scale the synchronous Stochastic Gradient Descent (SGD) method to multiple processors efficiently. We evaluate our framework by training a variety of widely used neural networks with the ImageNet dataset. On a single node, swCaffe can achieve 23\%\~{}119\% overall performance compared with Caffe running on K40m GPU. As compared with the Caffe on CPU, swCaffe runs 3.04\~{}7.84x faster on all the networks. Finally, we present the scalability of swCaffe for the training of ResNet-50 and AlexNet on the scale of 1024 nodes.

Learning to find order in disorder

We introduce the use of neural networks as classifiers on classical disordered systems with no spatial ordering. In this study, we implement a convolutional neural network trained to identify the spin-glass state in the three-dimensional Edwards-Anderson Ising spin-glass model from an input of Monte Carlo sampled configurations at a given temperature. The neural network is designed to be flexible with the input size and can accurately perform inference over a small sample of the instances in the test set. Using the neural network to classify instances of the three-dimensional Edwards-Anderson Ising spin-glass in a (random) field we show that the inferred phase boundary is consistent with the absence of an Almeida-Thouless line.

Visual Query Answering by Entity-Attribute Graph Matching and Reasoning

Visual Query Answering (VQA) is of great significance in offering people convenience: one can raise a question for details of objects, or high-level understanding about the scene, over an image. This paper proposes a novel method to address the VQA problem. In contrast to prior works, our method that targets single scene VQA, replies on graph-based techniques and involves reasoning. In a nutshell, our approach is centered on three graphs. The first graph, referred to as inference graph GI , is constructed via learning over labeled data. The other two graphs, referred to as query graph Q and entity-attribute graph GEA, are generated from natural language query Qnl and image Img, that are issued from users, respectively. As GEA often does not take sufficient information to answer Q, we develop techniques to infer missing information of GEA with GI . Based on GEA and Q, we provide techniques to find matches of Q in GEA, as the answer of Qnl in Img. Unlike commonly used VQA methods that are based on end-to-end neural networks, our graph-based method shows well-designed reasoning capability, and thus is highly interpretable. We also create a dataset on soccer match (Soccer-VQA) with rich annotations. The experimental results show that our approach outperforms the state-of-the-art method and has high potential for future investigation.

Practical Distributed Learning: Secure Machine Learning with Communication-Efficient Local Updates

Federated learning on edge devices poses new challenges arising from workers that misbehave, privacy needs, etc. We propose a new robust federated optimization algorithm, with provable convergence and robustness under non-IID settings. Empirical results show that the proposed algorithm stabilizes the convergence and tolerates data poisoning on a small number of workers.

Model-Based Task Transfer Learning

A model-based task transfer learning (MBTTL) method is presented. We consider a constrained nonlinear dynamical system and assume that a dataset of state and input pairs that solve a task T1 is available. Our objective is to find a feasible state-feedback policy for a second task, T1, by using stored data from T2. Our approach applies to tasks T2 which are composed of the same subtasks as T1, but in different order. In this paper we formally introduce the definition of subtask, the MBTTL problem and provide examples of MBTTL in the fields of autonomous cars and manipulators. Then, a computationally efficient approach to solve the MBTTL problem is presented along with proofs of feasibility for constrained linear dynamical systems. Simulation results show the effectiveness of the proposed method.

Change Point Detection in the Mean of High-Dimensional Time Series Data under Dependence

High-dimensional time series are characterized by a large number of measurements and complex dependence, and often involve abrupt change points. We propose a new procedure to detect change points in the mean of high-dimensional time series data. The proposed procedure incorporates spatial and temporal dependence of data and is able to test and estimate the change point occurred on the boundary of time series. We study its asymptotic properties under mild conditions. Simulation studies demonstrate its robust performance through the comparison with other existing methods. Our procedure is applied to an fMRI dataset.

Deep Feature Selection using a Teacher-Student Network

High-dimensional data in many machine learning applications leads to computational and analytical complexities. Feature selection provides an effective way for solving these problems by removing irrelevant and redundant features, thus reducing model complexity and improving accuracy and generalization capability of the model. In this paper, we present a novel teacher-student feature selection (TSFS) method in which a ‘teacher’ (a deep neural network or a complicated dimension reduction method) is first employed to learn the best representation of data in low dimension. Then a ‘student’ network (a simple neural network) is used to perform feature selection by minimizing the reconstruction error of low dimensional representation. Although the teacher-student scheme is not new, to the best of our knowledge, it is the first time that this scheme is employed for feature selection. The proposed TSFS can be used for both supervised and unsupervised feature selection. This method is evaluated on different datasets and is compared with state-of-the-art existing feature selection methods. The results show that TSFS performs better in terms of classification and clustering accuracies and reconstruction error. Moreover, experimental evaluations demonstrate a low degree of sensitivity to parameter selection in the proposed method.

DSPG: Decentralized Simultaneous Perturbations Gradient Descent Scheme

In this paper, we present an asynchronous approximate gradient method that is easy to implement called DSPG (Decentralized Simultaneous Perturbation Stochastic Approximations, with Constant Sensitivity Parameters). It is obtained by modifying SPSA (Simultaneous Perturbation Stochastic Approximations) to allow for decentralized optimization in multi-agent learning and distributed control scenarios. SPSA is a popular approximate gradient method developed by Spall, that is used in Robotics and Learning. In the multi-agent learning setup considered herein, the agents are assumed to be asynchronous (agents abide by their local clocks) and communicate via a wireless medium, that is prone to losses and delays. We analyze the gradient estimation bias that arises from setting the sensitivity parameters to a single value, and the bias that arises from communication losses and delays. Specifically, we show that these biases can be countered through better and frequent communication and/or by choosing a small fixed value for the sensitivity parameters. We also discuss the variance of the gradient estimator and its effect on the rate of convergence. Finally, we present numerical results supporting DSPG and the aforementioned theories and discussions.

Adversarial Attacks on Deep Neural Networks for Time Series Classification

Time Series Classification (TSC) problems are encountered in many real life data mining tasks ranging from medicine and security to human activity recognition and food safety. With the recent success of deep neural networks in various domains such as computer vision and natural language processing, researchers started adopting these techniques for solving time series data mining problems. However, to the best of our knowledge, no previous work has considered the vulnerability of deep learning models to adversarial time series examples, which could potentially make them unreliable in situations where the decision taken by the classifier is crucial such as in medicine and security. For computer vision problems, such attacks have been shown to be very easy to perform by altering the image and adding an imperceptible amount of noise to trick the network into wrongly classifying the input image. Following this line of work, we propose to leverage existing adversarial attack mechanisms to add a special noise to the input time series in order to decrease the network’s confidence when classifying instances at test time. Our results reveal that current state-of-the-art deep learning time series classifiers are vulnerable to adversarial attacks which can have major consequences in multiple domains such as food safety and quality assurance.

Learning Competitive and Discriminative Reconstructions for Anomaly Detection

Most of the existing methods for anomaly detection use only positive data to learn the data distribution, thus they usually need a pre-defined threshold at the detection stage to determine whether a test instance is an outlier. Unfortunately, a good threshold is vital for the performance and it is really hard to find an optimal one. In this paper, we take the discriminative information implied in unlabeled data into consideration and propose a new method for anomaly detection that can learn the labels of unlabelled data directly. Our proposed method has an end-to-end architecture with one encoder and two decoders that are trained to model inliers and outliers’ data distributions in a competitive way. This architecture works in a discriminative manner without suffering from overfitting, and the training algorithm of our model is adopted from SGD, thus it is efficient and scalable even for large-scale datasets. Empirical studies on 7 datasets including KDD99, MNIST, Caltech-256, and ImageNet etc. show that our model outperforms the state-of-the-art methods.

Time Series Predict DB

In this work, we are motivated to make predictive functionalities native to database systems with focus on time series data. We propose a system architecture, Time Series Predict DB, that enables predictive query in any existing time series database by building an additional ‘prediction index’ for time series data. To be effective, such an index needs to be built incrementally while keeping up with database throughput, able to scale with volume of data, provide accurate predictions for heterogeneous data, and allow for ‘predictive’ querying with latency comparable to the traditional database queries. Building upon a recently developed model agnostic time series algorithm by making it incremental and scalable, we build such a system on top of PostgreSQL. Using extensive experimentation, we show that our incremental prediction index updates faster than PostgreSQL (1\mu s per data for prediction index vs 4\mu s per data for PostgreSQL) and thus not affecting the throughput of the database. Across a variety of time series data, we find that our incremental, model agnostic algorithm provides better accuracy compared to the best state-of-art time series libraries (median improvement in range 3.29 to 4.19x over Prophet of Facebook, 1.27 to 1.48x over AMELIA in R). The latency of predictive queries with respect to SELECT queries (0.5ms) is < 1.9x (0.8ms) for imputation and < 7.6x (3ms) for forecasting across machine platforms. As a by-product, we find that the incremental, scalable variant we propose improves the accuracy of the batch prediction algorithm which may be of interest in its own right.

readPTU: a Python Library to Analyse Time Tagged Time Resolved Data

readPTU is a python package designed to analyze time-correlated single-photon counting data. The use of the library promotes the storage of the complete time arrival information of the photons and full flexibility in post-processing data for analysis. The library supports the computation of time resolved signal with external triggers and second order autocorrelation function analysis can be performed using multiple algorithms that provide the user with different trade-offs with regards to speed and accuracy. Additionally, a thresholding algorithm to perform time post-selection is also available. The library has been designed with performance and extensibility in mind to allow future users to implement support for additional file extensions and algorithms without having to deal with low level details. We demonstrate the performance of readPTU by analyzing the second-order autocorrelation function of the resonance fluorescence from a single quantum dot in a two-dimensional semiconductor.

Training Over-parameterized Deep ResNet Is almost as Easy as Training a Two-layer Network

It has been proved that gradient descent converges linearly to the global minima for training deep neural network in the over-parameterized regime. However, according to \citet{allen2018convergence}, the width of each layer should grow at least with the polynomial of the depth (the number of layers) for residual network (ResNet) in order to guarantee the linear convergence of gradient descent, which shows no obvious advantage over feedforward network. In this paper, we successfully remove the dependence of the width on the depth of the network for ResNet and reach a conclusion that training deep residual network can be as easy as training a two-layer network. This theoretically justifies the benefit of skip connection in terms of facilitating the convergence of gradient descent. Our experiments also justify that the width of ResNet to guarantee successful training is much smaller than that of deep feedforward neural network.

Topic-Guided Variational Autoencoders for Text Generation

We propose a topic-guided variational autoencoder (TGVAE) model for text generation. Distinct from existing variational autoencoder (VAE) based approaches, which assume a simple Gaussian prior for the latent code, our model specifies the prior as a Gaussian mixture model (GMM) parametrized by a neural topic module. Each mixture component corresponds to a latent topic, which provides guidance to generate sentences under the topic. The neural topic module and the VAE-based neural sequence module in our model are learned jointly. In particular, a sequence of invertible Householder transformations is applied to endow the approximate posterior of the latent code with high flexibility during model inference. Experimental results show that our TGVAE outperforms alternative approaches on both unconditional and conditional text generation, which can generate semantically-meaningful sentences with various topics.

Continue Reading…


Read More

nice student project

(This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

In all of my undergraduate classes, I require a term project, done in groups of 3-4 students. Though the topic is specified, it is largely open-ended, a level of “freedom” that many students are unaccustomed to. However, some adapt quite well. The topic this quarter was to choose a CRAN package that does not use any C/C++, and try to increase speed by converting some of the code to C/C++.

Some of the project submissions were really excellent. I decided to place one on the course Web page, and chose this one. Nice usage of Rcpp and devtools (neither of which was covered in class), very nicely presented.

To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Let’s get it right

Article: Are We Being Programmed?

A short read on the psychological impacts of app design, data science, and the humans predisposition for being conditioned. Before I jump in, let me ask you some questions. How often are you on your smartphone? Do you sometimes find yourself opening Facebook, YouTube, or Instagram for a spare second between tasks and end up spending more time than you meant? Have you ever found yourself reloading the feeds to see if something better just got posted? Have you ever closed a Facebook tab only to open it a few minutes later? Be honest about those questions since only you who needs to know, and I didn’t even ask about Tinder or porn. According to announcements by Facebook, users on average spend 50min of their day on the app (that’s out of 2 billion people across the globe). And it’s not just Facebook. Youtube, Snapchat, Instagram, and Twitter all have giant time slices out of their users’ day, and a lot of those users overlap.

Paper: Theories of Parenting and their Application to Artificial Intelligence

As machine learning (ML) systems have advanced, they have acquired more power over humans’ lives, and questions about what values are embedded in them have become more complex and fraught. It is conceivable that in the coming decades, humans may succeed in creating artificial general intelligence (AGI) that thinks and acts with an open-endedness and autonomy comparable to that of humans. The implications would be profound for our species; they are now widely debated not just in science fiction and speculative research agendas but increasingly in serious technical and policy conversations. Much work is underway to try to weave ethics into advancing ML research. We think it useful to add the lens of parenting to these efforts, and specifically radical, queer theories of parenting that consciously set out to nurture agents whose experiences, objectives and understanding of the world will necessarily be very different from their parents’. We propose a spectrum of principles which might underpin such an effort; some are relevant to current ML research, while others will become more important if AGI becomes more likely. These principles may encourage new thinking about the development, design, training, and release into the world of increasingly autonomous agents.

Paper: Online Explanation Generation for Human-Robot Teaming

As Artificial Intelligence (AI) becomes an integral part of our life, the development of explainable AI, embodied in the decision-making process of an AI or robotic agent, becomes imperative. For a robotic teammate, the ability to generate explanations to explain its behavior is one of the key requirements of an explainable agency. Prior work on explanation generation focuses on supporting the reasoning behind the robot’s behavior. These approaches, however, fail to consider the cognitive effort needed to understand the received explanation. In particular, the human teammate is expected to understand any explanation provided before the task execution, no matter how much information is presented in the explanation. In this work, we argue that an explanation, especially complex ones, should be made in an online fashion during the execution, which helps to spread out the information to be explained and thus reducing the cognitive load of humans. However, a challenge here is that the different parts of an explanation are dependent on each other, which must be taken into account when generating online explanations. To this end, a general formulation of online explanation generation is presented. We base our explanation generation method in a model reconciliation setting introduced in our prior work. Our approach is evaluated both with human subjects in a standard planning competition (IPC) domain, using NASA Task Load Index (TLX), as well as in simulation with four different problems.

Paper: Applying Probabilistic Programming to Affective Computing

Affective Computing is a rapidly growing field spurred by advancements in artificial intelligence, but often, held back by the inability to translate psychological theories of emotion into tractable computational models. To address this, we propose a probabilistic programming approach to affective computing, which models psychological-grounded theories as generative models of emotion, and implements them as stochastic, executable computer programs. We first review probabilistic approaches that integrate reasoning about emotions with reasoning about other latent mental states (e.g., beliefs, desires) in context. Recently-developed probabilistic programming languages offer several key desidarata over previous approaches, such as: (i) flexibility in representing emotions and emotional processes; (ii) modularity and compositionality; (iii) integration with deep learning libraries that facilitate efficient inference and learning from large, naturalistic data; and (iv) ease of adoption. Furthermore, using a probabilistic programming framework allows a standardized platform for theory-building and experimentation: Competing theories (e.g., of appraisal or other emotional processes) can be easily compared via modular substitution of code followed by model comparison. To jumpstart adoption, we illustrate our points with executable code that researchers can easily modify for their own models. We end with a discussion of applications and future directions of the probabilistic programming approach.

Article: Artificial Intelligence and Society

Press the pause button! Artificial Intelligence (AI) continues to be a growing focus in the media. An agenda gathering momentum like the cloud did, particularly in the business world. On a global path of technology innovation, AI may seem the next logical step towards progress. Computing power, storage, and processor speed have rapidly improved, and it’s now the turn of the algorithms. But what is progress? What is the cost? And is this what humanity really needs or wants? Who decides? A good place to begin is to define what AI actually is. For the purpose of this post, AI is software that, when executed, can demonstrate an element of decision-making where a programmed result may be unknown, and would typically require human intelligence to perform the decision-making task. AI usually includes an aspect of automated processing that engages one or more of the human senses i.e. sight, speech, sound, taste or smell. Recent discussions in the media, online articles and radio broadcasts sometimes blur the lines between two identifiable AI spaces:
• Near term: machines to perform faster, identify patterns, make unaided decisions, and undertake relatively complex tasks with the goal of reducing any human requirement to perform the same tasks.
• Long term: machines to potentially possess the characteristic of ‘consciousness’ – this is a different space.
Conversations can wander between the impact of these two very different visions. I recently listened to a radio discussion where a caller spoke about a cull on jobs and the knock-on effects within society, but then the caller leapt to a possibility that machines could wipe out humankind. The distinction between the two is important.

Paper: Responses to a Critique of Artificial Moral Agents

The field of machine ethics is concerned with the question of how to embed ethical behaviors, or a means to determine ethical behaviors, into artificial intelligence (AI) systems. The goal is to produce artificial moral agents (AMAs) that are either implicitly ethical (designed to avoid unethical consequences) or explicitly ethical (designed to behave ethically). Van Wynsberghe and Robbins’ (2018) paper Critiquing the Reasons for Making Artificial Moral Agents critically addresses the reasons offered by machine ethicists for pursuing AMA research; this paper, co-authored by machine ethicists and commentators, aims to contribute to the machine ethics conversation by responding to that critique. The reasons for developing AMAs discussed in van Wynsberghe and Robbins (2018) are: it is inevitable that they will be developed; the prevention of harm; the necessity for public trust; the prevention of immoral use; such machines are better moral reasoners than humans, and building these machines would lead to a better understanding of human morality. In this paper, each co-author addresses those reasons in turn. In so doing, this paper demonstrates that the reasons critiqued are not shared by all co-authors; each machine ethicist has their own reasons for researching AMAs. But while we express a diverse range of views on each of the six reasons in van Wynsberghe and Robbins’ critique, we nevertheless share the opinion that the scientific study of AMAs has considerable value.

Paper: Responsible and Representative Multimodal Data Acquisition and Analysis: On Auditability, Benchmarking, Confidence, Data-Reliance & Explainability

The ethical decisions behind the acquisition and analysis of audio, video or physiological human data, harnessed for (deep) machine learning algorithms, is an increasing concern for the Artificial Intelligence (AI) community. In this regard, herein we highlight the growing need for responsible, and representative data collection and analysis, through a discussion of modality diversification. Factors such as Auditability, Benchmarking, Confidence, Data-reliance, and Explainability (ABCDE), have been touched upon within the machine learning community, and here we lay out these ABCDE sub-categories in relation to the acquisition and analysis of multimodal data, to weave through the high priority ethical concerns currently under discussion for AI. To this end, we propose how these five subcategories can be included in early planning of such acquisition paradigms.

Paper: Machine Learning: A Dark Side of Cancer Computing

Cancer analysis and prediction is the utmost important research field for well-being of humankind. The Cancer data are analyzed and predicted using machine learning algorithms. Most of the researcher claims the accuracy of the predicted results within 99%. However, we show that machine learning algorithms can easily predict with an accuracy of 100% on Wisconsin Diagnostic Breast Cancer dataset. We show that the method of gaining accuracy is an unethical approach that we can easily mislead the algorithms. In this paper, we exploit the weakness of Machine Learning algorithms. We perform extensive experiments for the correctness of our results to exploit the weakness of machine learning algorithms. The methods are rigorously evaluated to validate our claim. In addition, this paper focuses on correctness of accuracy. This paper report three key outcomes of the experiments, namely, correctness of accuracies, significance of minimum accuracy, and correctness of machine learning algorithms.

Continue Reading…


Read More

Magister Dixit

“Within 10 years, data science will be so enmeshed within industry-specific applications and broad productivity tools that we may no longer think of it is a hot career. Just as generations of math and statistics students have gone on to fill all manner of roles in business and academia without thinking of themselves as mathematicians or statisticians, the newly minted data scientist grads will be tomorrow’s manufacturing engineers, marketing leaders and medical researchers.” Nate Oostendorp ( Mar 1, 2019 )

Continue Reading…


Read More

Should we talk less about bad social science research and more about bad medical research?

Paul Alper pointed me to this news story, “Harvard Calls for Retraction of Dozens of Studies by Noted Cardiac Researcher: Some 31 studies by Dr. Piero Anversa contain fabricated or falsified data, officials concluded. Dr. Anversa popularized the idea of stem cell treatment for damaged hearts.”

I replied: Ahhh, Harvard . . . the reporter should’ve asked Marc Hauser for a quote.

Alper responded:

Marc Hauser’s research involved “cotton-top tamarin monkeys” while Piero Anversa was falsifying and spawning research on damaged hearts:

The cardiologist rocketed to fame in 2001 with a flashy paper claiming that, contrary to scientific consensus, heart muscle could be regenerated. If true, the research would have had enormous significance for patients worldwide.

I, and I suspect that virtually all of the other contributors to your blog know nothing** about cotton-top tamarin monkeys but are fascinated and interested in stem cells and heart regeneration. Consequently, are Hauser and Anversa separated by a chasm or should they be lumped together in the Hall of Shame? Put another way, do we have yet an additional instance of crime and appropriate punishment?

**Your blog audience is so broad that there well may be cotton-top tamarin monkey mavens out there dying to hit the enter key.

Good point. It’s not up to me at all: I don’t administer punishment of any sort; as a blogger I function as a very small news organization, and my only role is to sometimes look into these cases, bring them to others’ notice, and host discussions. If it were up to me, David Weakliem and Jay Livingston would be regular New York Times columnists, and Mark Palko and Joseph Delaney would be the must-read bloggers that everyone would check each morning. Also, if it were up to me, everyone would have to post all their data and code—at least, that would be the default policy; researchers would have to give very good reasons to get out of this requirement. (Not that I always or even usually post my data and code; but I should do better too.) But none of these things are up to me.

From Harvard’s point of view, perhaps the question is whether they should go easy on people like Hauser, a person who is basically an entertainer, and whose main crime was to fake some of his entertainment—a sort of Doris Kearns Goodwin, if you will—. and be tougher on people such as Anversa, whose misdeeds can cost lives. (I don’t know where you should put someone like John Yoo who advocated for actual torture, but I suppose that someone who agreed with Yoo politically would make a similar argument against, say, old-style apologists for the Soviet Union.)

One argument for not taking people like Hauser, Wansink, etc., seriously, even in their misdeeds, is that after the flaws in their methods were revealed—after it turned out that their blithe confidence (in Wansink’s case) or attacks on whistleblowers (in Hauser’s case) were not borne out by the data—these guys just continued to say their original claims were valid. So, for them, it was never about the data at all, it was always about their stunning ideas. Or, to put it another way, the data were there to modify the details of their existing hypotheses, or to allow them to gently develop and extend their models, in a way comparable to how Philip K. Dick used the I Ching to decide what would happen next in his books. (Actually, that analogy is pretty good, as one could just as well say that Dick he used randomness not so much to “decide what would happen” but rather “to discover what would happen” next.)

Anyway, to get back to the noise-miners: The supposed empirical support was just there for them to satisfy the conventions of modern-day science. So when it turned out that the promised data had never been there . . . so what, really? The data never mattered in the first place, as these researchers implicitly admitted by not giving up on any of their substantive claims. So maybe these profs should just move into the Department of Imaginative Literature and the universities can call it a day. The medical researchers who misreport their data: That’s a bigger problem.

And what about the news media, myself included? Should I spend more time blogging about medical research and less time blogging about social science research? It’s a tough call. Social science is my own area of expertise, so I think I’m making more of a contribution by leveraging that expertise than by opining on medical research that I don’t really understand.

A related issue is accessibility: people send me more items on social science, and it takes me less effort to evaluate social science claims.

Also, I think social science is important. It does not seem that there’s any good evidence that elections are determined by shark attacks or the outcomes of college football games, or that subliminal smiley faces cause large swings in opinion, or that women’s political preferences vary greatly based on time of the month—but if any (or, lord help us, all) of these claims were true, then this would be consequential: it would “punch a big hole in democratic theory,” in the memorable words of Larry Bartels.

Monkey language and bottomless soup bowls: I don’t care about those so much. So why have I devoted so much blog space to those silly cases? Partly its from a fascination with people who refuse to admit error even when it’s staring them in the face, partly because it can give insights into general issues and statistics and science, and partly because I think people can miss the point in these cases by focusing on the drama and missing out on the statistics; see for example here and here. But mostly I write more about social science because social science is my “thing.” Just like I write more about football and baseball than about rugby and cricket.

P.S. One more thing: Don’t forget that in all these fields, social science, medical science, whatever, the problem’s is not just with bad research, cheaters, or even incompetents. No, there are big problems even with solid research done by honest researchers who are doing their best but are still using methods that misrepresent what we learn from the data. For example, the ORBITA study of heart stents, where p=0.20 (actually p=0.09 when the data were analyzed more appropriately) was widely reported as implying no effect. Honesty and transparency—and even skill and competence in the use of standard methods—are not enough. Sometimes, as in the above post, it makes sense to talk about flat-out bad research and the prominent people who do it, but that’s only one part of the story.

Continue Reading…


Read More

ShinyProxy 2.2.0

(This article was first published on Open Analytics, and kindly contributed to R-bloggers)

ShinyProxy is a novel, open source platform to deploy Shiny apps for the enterprise
or larger organizations.

Secured Embedding of Shiny Apps

Since version 2.0.1 ShinyProxy provides a REST API to manage (launch, shut down) Shiny apps and consume the content programmatically inside broader web applications or portals. This allows to cleanly separate the responsiblity for the Shiny apps (data science teams) and those broader applications (IT teams) while still achieving seamless integration between the two from the user’s perspective. With this release we go one step further and support the industry standard to protect REST APIs, namely OAuth 2.0.

In practice this means the following: when users of the portal log on, they typically authenticate with an OAuth2 provider (e.g. Auth0). This then allows the web application to access the ShinyProxy API on their behalf and launch the Shiny apps over the ShinyProxy API. We leave out the details on authorization codes and access tokens, but the core message is that you can now embed Shiny apps in virtually any other web application in a secure way. If you want an actual example,
please head to our Github page with ShinyProxy configuration examples, where a sample Node.js application is made available to demonstrate the full scenario.

OAuth 2.0 logo

Miscellaneous improvements

More generally, users are deploying ShinyProxy on a wide array of cloud platforms and using a great variety of authentication technologies. A lot of the experience gained can now be found in updated documentation with additional examples on e.g. AWS Cognito (here) or Microsoft Azure AD B2C (here), next to Google and Auth0 (here). These production deployments also called for more extensive documentation on logging with ShinyProxy and at that level we also introduced a new setting logging.requestdump to enable full request dump for advanced debugging. Then, for user convenience we introduced user-friendly URLs to access an app either via the standard ShinyProxy interface (/app/) or directly (/app_direct/) if needed.

Full release notes can be found on the downloads page and updated documentation can be found on As always community support on this new release is available at

Don’t hesitate to send in questions or suggestions and have fun with ShinyProxy!

To leave a comment for the author, please follow the link and comment on their blog: Open Analytics. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Whats new on arXiv

Distance Preserving Grid Layouts

Distance preserving visualization techniques have emerged as one of the fundamental tools for data analysis. One example are the techniques that arrange data instances into two-dimensional grids so that the pairwise distances among the instances are preserved into the produced layouts. Currently, the state-of-the-art approaches produce such grids by solving assignment problems or using permutations to optimize cost functions. Although precise, such strategies are computationally expensive, limited to small datasets or being dependent on specialized hardware to speed up the process. In this paper, we present a new technique, called Distance-preserving Grid (DGrid), that employs a binary space partitioning process in combination with multidimensional projections to create orthogonal regular grid layouts. Our results show that DGrid is as precise as the existing state-of-the-art techniques whereas requiring only a fraction of the running time and computational resources.

Estimating Dynamic Conditional Spread Densities to Optimise Daily Storage Trading of Electricity

This paper formulates dynamic density functions, based upon skewed-t and similar representations, to model and forecast electricity price spreads between different hours of the day. This supports an optimal day ahead storage and discharge schedule, and thereby facilitates a bidding strategy for a merchant arbitrage facility into the day-ahead auctions for wholesale electricity. The four latent moments of the density functions are dynamic and conditional upon exogenous drivers, thereby permitting the mean, variance, skewness and kurtosis of the densities to respond hourly to such factors as weather and demand forecasts. The best specification for each spread is selected based on the Pinball Loss function, following the closed form analytical solutions of the cumulative density functions. Those analytical properties also allow the calculation of risk associated with the spread arbitrages. From these spread densities, the optimal daily operation of a battery storage facility is determined.

Global Fire Season Severity Analysis and Forecasting

Global fire activity has a huge impact on human lives. In recent years, many fire models have been developed to forecast fire activity. They present good results for some regions but require complex parametrizations and input variables that are not easily obtained or estimated. In this paper, we evaluate the possibility of using historical data from 2003 to 2017 of active fire detections (NASA’s MODIS MCD14ML C6) and time series forecasting methods to estimate global fire season severity (FSS), here defined as the accumulated fire detections in a season. We used a hexagonal grid to divide the globe, and we extracted time series of daily fire counts from each cell. We propose a straightforward method to estimate the fire season lengths. Our results show that in 99% of the cells, the fire seasons have lengths shorter than seven months. Given this result, we extracted the fire seasons defined as time windows of seven months centered in the months with the highest fire occurrence. We define fire season severity (FSS) as the accumulated fire detections in a season. A trend analysis suggests a global decrease in length and severity. Since FSS time series are concise, we used the monthly-accumulated fire counts (MA-FC) to train and test the seven forecasting models. Results show low forecasting errors in some areas. Therefore we conclude that many regions present predictable variations in the FSS.

On confidence intervals centered on bootstrap smoothed estimators

We assess the performance, in terms of coverage probability and expected length, of confidence intervals centered on the bootstrap smoothed (bagged) estimator, for two nested linear regression models, with unknown error variance, and model selection using a preliminary t test.

Episodic Memory Reader: Learning What to Remember for Question Answering from Streaming Data

We consider a novel question answering (QA) task where the machine needs to read from large streaming data (long documents or videos) without knowing when the questions will be given, in which case the existing QA methods fail due to lack of scalability. To tackle this problem, we propose a novel end-to-end reading comprehension method, which we refer to as Episodic Memory Reader (EMR) that sequentially reads the input contexts into an external memory, while replacing memories that are less important for answering unseen questions. Specifically, we train an RL agent to replace a memory entry when the memory is full in order to maximize its QA accuracy at a future timepoint, while encoding the external memory using the transformer architecture to learn representations that considers relative importance between the memory entries. We validate our model on a real-world large-scale textual QA task (TriviaQA) and a video QA task (TVQA), on which it achieves significant improvements over rule-based memory scheduling policies or an RL-based baseline that learns the query-specific importance of each memory independently.

Improving Neural Architecture Search Image Classifiers via Ensemble Learning

Finding the best neural network architecture requires significant time, resources, and human expertise. These challenges are partially addressed by neural architecture search (NAS) which is able to find the best convolutional layer or cell that is then used as a building block for the network. However, once a good building block is found, manual design is still required to assemble the final architecture as a combination of multiple blocks under a predefined parameter budget constraint. A common solution is to stack these blocks into a single tower and adjust the width and depth to fill the parameter budget. However, these single tower architectures may not be optimal. Instead, in this paper we present the AdaNAS algorithm, that uses ensemble techniques to compose a neural network as an ensemble of smaller networks automatically. Additionally, we introduce a novel technique based on knowledge distillation to iteratively train the smaller networks using the previous ensemble as a teacher. Our experiments demonstrate that ensembles of networks improve accuracy upon a single neural network while keeping the same number of parameters. Our models achieve comparable results with the state-of-the-art on CIFAR-10 and sets a new state-of-the-art on CIFAR-100.

Show, Translate and Tell

Humans have an incredible ability to process and understand information from multiple sources such as images, video, text, and speech. Recent success of deep neural networks has enabled us to develop algorithms which give machines the ability to understand and interpret this information. There is a need to both broaden their applicability and develop methods which correlate visual information along with semantic content. We propose a unified model which jointly trains on images and captions, and learns to generate new captions given either an image or a caption query. We evaluate our model on three different tasks namely cross-modal retrieval, image captioning, and sentence paraphrasing. Our model gains insight into cross-modal vector embeddings, generalizes well on multiple tasks and is competitive to state of the art methods on retrieval.

Inference Without Compatibility

We consider hypothesis testing problems for a single covariate in the context of a linear model with Gaussian design when p>n. Under minimal sparsity conditions of their type and without any compatibility condition, we construct an asymptotically Gaussian estimator with variance equal to the oracle least-squares. The estimator is based on a weighted average of all models of a given sparsity level in the spirit of exponential weighting. We adapt this procedure to estimate the signal strength and provide a few applications. We support our results using numerical simulations based on algorithm which approximates the theoretical estimator and provide a comparison with the de-biased lasso.

Distributed Constrained Online Learning

In this paper, we consider groups of agents in a network that select actions in order to satisfy a set of constraints that vary arbitrarily over time and minimize a time-varying function of which they have only local observations. The selection of actions, also called a strategy, is causal and decentralized, i.e., the dynamical system that determines the actions of a given agent depends only on the constraints at the current time and on its own actions and those of its neighbors. To determine such a strategy, we propose a decentralized saddle point algorithm and show that the corresponding global fit and regret are bounded by functions of the order of \sqrt{T}. Specifically, we define the global fit of a strategy as a vector that integrates over time the global constraint violations as seen by a given node. The fit is a performance loss associated with online operation as opposed to offline clairvoyant operation which can always select an action if one exists, that satisfies the constraints at all times. If this fit grows sublinearly with the time horizon it suggests that the strategy approaches the feasible set of actions. Likewise, we define the regret of a strategy as the difference between its accumulated cost and that of the best fixed action that one could select knowing beforehand the time evolution of the objective function. Numerical examples support the theoretical conclusions.

On Target Shift in Adversarial Domain Adaptation

Discrepancy between training and testing domains is a fundamental problem in the generalization of machine learning techniques. Recently, several approaches have been proposed to learn domain invariant feature representations through adversarial deep learning. However, label shift, where the percentage of data in each class is different between domains, has received less attention. Label shift naturally arises in many contexts, especially in behavioral studies where the behaviors are freely chosen. In this work, we propose a method called Domain Adversarial nets for Target Shift (DATS) to address label shift while learning a domain invariant representation. This is accomplished by using distribution matching to estimate label proportions in a blind test set. We extend this framework to handle multiple domains by developing a scheme to upweight source domains most similar to the target domain. Empirical results show that this framework performs well under large label shift in synthetic and real experiments, demonstrating the practical importance.

Online Explanation Generation for Human-Robot Teaming

As Artificial Intelligence (AI) becomes an integral part of our life, the development of explainable AI, embodied in the decision-making process of an AI or robotic agent, becomes imperative. For a robotic teammate, the ability to generate explanations to explain its behavior is one of the key requirements of an explainable agency. Prior work on explanation generation focuses on supporting the reasoning behind the robot’s behavior. These approaches, however, fail to consider the cognitive effort needed to understand the received explanation. In particular, the human teammate is expected to understand any explanation provided before the task execution, no matter how much information is presented in the explanation. In this work, we argue that an explanation, especially complex ones, should be made in an online fashion during the execution, which helps to spread out the information to be explained and thus reducing the cognitive load of humans. However, a challenge here is that the different parts of an explanation are dependent on each other, which must be taken into account when generating online explanations. To this end, a general formulation of online explanation generation is presented. We base our explanation generation method in a model reconciliation setting introduced in our prior work. Our approach is evaluated both with human subjects in a standard planning competition (IPC) domain, using NASA Task Load Index (TLX), as well as in simulation with four different problems.

Cloud-Edge Coordinated Processing: Low-Latency Multicasting Transmission

Recently, edge caching and multicasting arise as two promising technologies to support high-data-rate and low-latency delivery in wireless communication networks. In this paper, we design three transmission schemes aiming to minimize the delivery latency for cache-enabled multigroup multicasting networks. In particular, full caching bulk transmission scheme is first designed as a performance benchmark for the ideal situation where the caching capability of each enhanced remote radio head (eRRH) is sufficient large to cache all files. For the practical situation where the caching capability of each eRRH is limited, we further design two transmission schemes, namely partial caching bulk transmission (PCBT) and partial caching pipelined transmission (PCPT) schemes. In the PCBT scheme, eRRHs first fetch the uncached requested files from the baseband unit (BBU) and then all requested files are simultaneously transmitted to the users. In the PCPT scheme, eRRHs first transmit the cached requested files while fetching the uncached requested files from the BBU. Then, the remaining cached requested files and fetched uncached requested files are simultaneously transmitted to the users. The design goal of the three transmission schemes is to minimize the delivery latency, subject to some practical constraints. Efficient algorithms are developed for the low-latency cloud-edge coordinated transmission strategies. Numerical results are provided to evaluate the performance of the proposed transmission schemes and show that the PCPT scheme outperforms the PCBT scheme in terms of the delivery latency criterion.

Applying Probabilistic Programming to Affective Computing

Affective Computing is a rapidly growing field spurred by advancements in artificial intelligence, but often, held back by the inability to translate psychological theories of emotion into tractable computational models. To address this, we propose a probabilistic programming approach to affective computing, which models psychological-grounded theories as generative models of emotion, and implements them as stochastic, executable computer programs. We first review probabilistic approaches that integrate reasoning about emotions with reasoning about other latent mental states (e.g., beliefs, desires) in context. Recently-developed probabilistic programming languages offer several key desidarata over previous approaches, such as: (i) flexibility in representing emotions and emotional processes; (ii) modularity and compositionality; (iii) integration with deep learning libraries that facilitate efficient inference and learning from large, naturalistic data; and (iv) ease of adoption. Furthermore, using a probabilistic programming framework allows a standardized platform for theory-building and experimentation: Competing theories (e.g., of appraisal or other emotional processes) can be easily compared via modular substitution of code followed by model comparison. To jumpstart adoption, we illustrate our points with executable code that researchers can easily modify for their own models. We end with a discussion of applications and future directions of the probabilistic programming approach.

A Context-Aware Citation Recommendation Model with BERT and Graph Convolutional Networks

With the tremendous growth in the number of scientific papers being published, searching for references while writing a scientific paper is a time-consuming process. A technique that could add a reference citation at the appropriate place in a sentence will be beneficial. In this perspective, context-aware citation recommendation has been researched upon for around two decades. Many researchers have utilized the text data called the context sentence, which surrounds the citation tag, and the metadata of the target paper to find the appropriate cited research. However, the lack of well-organized benchmarking datasets and no model that can attain high performance has made the research difficult. In this paper, we propose a deep learning based model and well-organized dataset for context-aware paper citation recommendation. Our model comprises a document encoder and a context encoder, which uses Graph Convolutional Networks (GCN) layer and Bidirectional Encoder Representations from Transformers (BERT), which is a pre-trained model of textual data. By modifying the related PeerRead dataset, we propose a new dataset called FullTextPeerRead containing context sentences to cited references and paper metadata. To the best of our knowledge, This dataset is the first well-organized dataset for context-aware paper recommendation. The results indicate that the proposed model with the proposed datasets can attain state-of-the-art performance and achieve a more than 28% improvement in mean average precision (MAP) and recall@k.

Multimodal Deep Learning for Finance: Integrating and Forecasting International Stock Markets

Stock prices are influenced by numerous factors. We present a method to combine these factors and we validate the method by taking the international stock market as a case study. In today’s increasingly international economy, return and volatility spillover effects across international equity markets are major macroeconomic drivers of stock dynamics. Thus, foreign market information is one of the most important factors in forecasting domestic stock prices. However, the cross-correlation between domestic and foreign markets is so complex that it would be extremely difficult to express it explicitly with a dynamical equation. In this study, we develop stock return prediction models that can jointly consider international markets, using multimodal deep learning. Our contributions are three-fold: (1) we visualize the transfer information between South Korea and US stock markets using scatter plots; (2) we incorporate the information into stock prediction using multimodal deep learning; (3) we conclusively show that both early and late fusion models achieve a significant performance boost in comparison with single modality models. Our study indicates that considering international stock markets jointly can improve prediction accuracy, and deep neural networks are very effective for such tasks.

Neuromorphic Hardware learns to learn

Hyperparameters and learning algorithms for neuromorphic hardware are usually chosen by hand. In contrast, the hyperparameters and learning algorithms of networks of neurons in the brain, which they aim to emulate, have been optimized through extensive evolutionary and developmental processes for specific ranges of computing and learning tasks. Occasionally this process has been emulated through genetic algorithms, but these require themselves hand-design of their details and tend to provide a limited range of improvements. We employ instead other powerful gradient-free optimization tools, such as cross-entropy methods and evolutionary strategies, in order to port the function of biological optimization processes to neuromorphic hardware. As an example, we show that this method produces neuromorphic agents that learn very efficiently from rewards. In particular, meta-plasticity, i.e., the optimization of the learning rule which they use, substantially enhances reward-based learning capability of the hardware. In addition, we demonstrate for the first time Learning-to-Learn benefits from such hardware, in particular, the capability to extract abstract knowledge from prior learning experiences that speeds up the learning of new but related tasks. Learning-to-Learn is especially suited for accelerated neuromorphic hardware, since it makes it feasible to carry out the required very large number of network computations.

Content Differences in Syntactic and Semantic Representations

Syntactic analysis plays an important role in semantic parsing, but the nature of this role remains a topic of ongoing debate. The debate has been constrained by the scarcity of empirical comparative studies between syntactic and semantic schemes, which hinders the development of parsing methods informed by the details of target schemes and constructions. We target this gap, and take Universal Dependencies (UD) and UCCA as a test case. After abstracting away from differences of convention or formalism, we find that most content divergences can be ascribed to: (1) UCCA’s distinction between a Scene and a non-Scene; (2) UCCA’s distinction between primary relations, secondary ones and participants; (3) different treatment of multi-word expressions, and (4) different treatment of inter-clause linkage. We further discuss the long tail of cases where the two schemes take markedly different approaches. Finally, we show that the proposed comparison methodology can be used for fine-grained evaluation of UCCA parsing, highlighting both challenges and potential sources for improvement. The substantial differences between the schemes suggest that semantic parsers are likely to benefit downstream text understanding applications beyond their syntactic counterparts.

MFAS: Multimodal Fusion Architecture Search

We tackle the problem of finding good architectures for multimodal classification problems. We propose a novel and generic search space that spans a large number of possible fusion architectures. In order to find an optimal architecture for a given dataset in the proposed search space, we leverage an efficient sequential model-based exploration approach that is tailored for the problem. We demonstrate the value of posing multimodal fusion as a neural architecture search problem by extensive experimentation on a toy dataset and two other real multimodal datasets. We discover fusion architectures that exhibit state-of-the-art performance for problems with different domain and dataset size, including the NTU RGB+D dataset, the largest multi-modal action recognition dataset available.

Role of Bloom Filter in Big Data Research: A Survey

Big Data is the most popular emerging trends that becomes a blessing for human kinds and it is the necessity of day-to-day life. For example, Facebook. Every person involves with producing data either directly or indirectly. Thus, Big Data is a high volume of data with exponential growth rate that consists of a variety of data. Big Data touches all fields, including Government sector, IT industry, Business, Economy, Engineering, Bioinformatics, and other basic sciences. Thus, Big Data forms a data silo. Most of the data are duplicates and unstructured. To deal with such kind of data silo, Bloom Filter is a precious resource to filter out the duplicate data. Also, Bloom Filter is inevitable in a Big Data storage system to optimize the memory consumption. Undoubtedly, Bloom Filter uses a tiny amount of memory space to filter a very large data size and it stores information of a large set of data. However, functionality of the Bloom Filter is limited to membership filter, but it can be adapted in various applications. Besides, the Bloom Filter is deployed in diverse field, and also used in the interdisciplinary research area. Bioinformatics, for instance. In this article, we expose the usefulness of Bloom Filter in Big Data research.

Selective Kernel Networks

In standard Convolutional Neural Networks (CNNs), the receptive fields of artificial neurons in each layer are designed to share the same size. It is well-known in the neuroscience community that the receptive field size of visual cortical neurons are modulated by the stimulus, which has been rarely considered in constructing CNNs. We propose a dynamic selection mechanism in CNNs that allows each neuron to adaptively adjust its receptive field size based on multiple scales of input information. A building block called Selective Kernel (SK) unit is designed, in which multiple branches with different kernel sizes are fused using softmax attention that is guided by the information in these branches. Different attentions on these branches yield different sizes of the effective receptive fields of neurons in the fusion layer. Multiple SK units are stacked to a deep network termed Selective Kernel Networks (SKNets). On the ImageNet and CIFAR benchmarks, we empirically show that SKNet outperforms the existing state-of-the-art architectures with lower model complexity. Detailed analyses show that the neurons in SKNet can capture target objects with different scales, which verifies the capability of neurons for adaptively adjusting their recpeitve field sizes according to the input. The code and models are available at https://…/SKNet.

Deep Neural Network Ensembles for Time Series Classification

Deep neural networks have revolutionized many fields such as computer vision and natural language processing. Inspired by this recent success, deep learning started to show promising results for Time Series Classification (TSC). However, neural networks are still behind the state-of-the-art TSC algorithms, that are currently composed of ensembles of 37 non deep learning based classifiers. We attribute this gap in performance due to the lack of neural network ensembles for TSC. Therefore in this paper, we show how an ensemble of 60 deep learning models can significantly improve upon the current state-of-the-art performance of neural networks for TSC, when evaluated over the UCR/UEA archive: the largest publicly available benchmark for time series analysis. Finally, we show how our proposed Neural Network Ensemble (NNE) is the first time series classifier to outperform COTE while reaching similar performance to the current state-of-the-art ensemble HIVE-COTE.

Matching Entities Across Different Knowledge Graphs with Graph Embeddings

This paper explores the problem of matching entities across different knowledge graphs. Given a query entity in one knowledge graph, we wish to find the corresponding real-world entity in another knowledge graph. We formalize this problem and present two large-scale datasets for this task based on exiting cross-ontology links between DBpedia and Wikidata, focused on several hundred thousand ambiguous entities. Using a classification-based approach, we find that a simple multi-layered perceptron based on representations derived from RDF2Vec graph embeddings of entities in each knowledge graph is sufficient to achieve high accuracy, with only small amounts of training data. The contributions of our work are datasets for examining this problem and strong baselines on which future work can be based.

On Evaluation of Adversarial Perturbations for Sequence-to-Sequence Models

Adversarial examples — perturbations to the input of a model that elicit large changes in the output — have been shown to be an effective way of assessing the robustness of sequence-to-sequence (seq2seq) models. However, these perturbations only indicate weaknesses in the model if they do not change the input so significantly that it legitimately results in changes in the expected output. This fact has largely been ignored in the evaluations of the growing body of related literature. Using the example of untargeted attacks on machine translation (MT), we propose a new evaluation framework for adversarial attacks on seq2seq models that takes the semantic equivalence of the pre- and post-perturbation input into account. Using this framework, we demonstrate that existing methods may not preserve meaning in general, breaking the aforementioned assumption that source side perturbations should not result in changes in the expected output. We further use this framework to demonstrate that adding additional constraints on attacks allows for adversarial perturbations that are more meaning-preserving, but nonetheless largely change the output sequence. Finally, we show that performing untargeted adversarial training with meaning-preserving attacks is beneficial to the model in terms of adversarial robustness, without hurting test performance. A toolkit implementing our evaluation framework is released at https://…/teapot-nlp.

Improving Strong-Scaling of CNN Training by Exploiting Finer-Grained Parallelism

Scaling CNN training is necessary to keep up with growing datasets and reduce training time. We also see an emerging need to handle datasets with very large samples, where memory requirements for training are large. Existing training frameworks use a data-parallel approach that partitions samples within a mini-batch, but limits to scaling the mini-batch size and memory consumption makes this untenable for large samples. We describe and implement new approaches to convolution, which parallelize using spatial decomposition or a combination of sample and spatial decomposition. This introduces many performance knobs for a network, so we develop a performance model for CNNs and present a method for using it to automatically determine efficient parallelization strategies. We evaluate our algorithms with microbenchmarks and image classification with ResNet-50. Our algorithms allow us to prototype a model for a mesh-tangling dataset, where sample sizes are very large. We show that our parallelization achieves excellent strong and weak scaling and enables training for previously unreachable datasets.

Tuning Hyperparameters without Grad Students: Scalable and Robust Bayesian Optimisation with Dragonfly

Bayesian Optimisation (BO), refers to a suite of techniques for global optimisation of expensive black box functions, which use introspective Bayesian models of the function to efficiently find the optimum. While BO has been applied successfully in many applications, modern optimisation tasks usher in new challenges where conventional methods fail spectacularly. In this work, we present Dragonfly, an open source Python library for scalable and robust BO. Dragonfly incorporates multiple recently developed methods that allow BO to be applied in challenging real world settings; these include better methods for handling higher dimensional domains, methods for handling multi-fidelity evaluations when cheap approximations of an expensive function are available, methods for optimising over structured combinatorial spaces, such as the space of neural network architectures, and methods for handling parallel evaluations. Additionally, we develop new methodological improvements in BO for selecting the Bayesian model, selecting the acquisition function, and optimising over complex domains with different variable types and additional constraints. We compare Dragonfly to a suite of other packages and algorithms for global optimisation and demonstrate that when the above methods are integrated, they enable significant improvements in the performance of BO. The Dragonfly library is available at

Continue Reading…


Read More

Using R and H2O to identify product anomalies during the manufacturing process.

(This article was first published on R-Analytics, and kindly contributed to R-bloggers)


We will identify anomalous products on the production line by using measurements from testing stations and deep learning models. Anomalous products are not failures, these anomalies are products close to the measurement limits, so we can display warnings before the process starts to make failed products and in this way the stations get maintenance.

 Before starting we need the next software installed and working:

R language installed.
– All the R packages mentioned in the R sources.
– Testing stations data, I suggest to go station by station.
H2O open source framework.
– Java 8 ( For H2O ). Open JDK:
R studio.

Get your data.

About the data: Since I cannot use my real data, for this article I am using SECOM Data Set from UCI Machine Learning Repository   

How many records?: 
Training data set – In my real project, I use 100 thousand test passed records, it is around a month of production data.
Testing data set – I use the last 24 hours of testing station data.
Let’s the fun begin
Deep Learning Model Creation And Testing.

# Load libraries
library( h2o )

h2o.init( nthreads = -1, max_mem_size = “5G”, port = 6666 )

h2o.removeAll() ## Removes the data from the h2o cluster in preparation for our final model.

# Reading SECOM data file
allData = read.csv( “”, sep = ” “, header = FALSE, encoding = “UTF-8” )

# fixing the data set, there are a lot of NaN records
if( dim(na.omit(allData))[1] == 0 ){
  for( colNum in 1:dim( allData )[2] ){

    # Get valid values from the actual column values
    ValidColumnValues = allData[,colNum][!is.nan( allData[, colNum] )]

    # Check each value in the actual active column.
    for( rowNum in 1:dim( allData )[1] ){

    cat( “Processing row:”, rowNum, “, Column:”, colNum, “Data:”, allData[rowNum, colNum], “\n” )

    if( is.nan( allData[rowNum, colNum] ) ) {

      # Assign random valid value to our row,column with NA value

      getValue = ValidColumnValues[ floor( runif(1, min = 1, max = length( ValidColumnValues ) ) ) ]

      allData[rowNum, colNum] = getValue

# spliting all data, the fiirst 90% for training and the rest 10% for testing our model.
trainingData = allData[1:floor(dim(allData)[1]*.9),]
testingData = allData[(floor(dim(allData)[1]*.9)+1):dim(allData)[1],]

# Convert the training dataset to H2O format.

trainingData_hex = as.h2o( trainingData, destination_frame = “train_hex” )

# Set the input variables
featureNames = colnames(trainingData_hex)

# Creating the first model version.
trainingModel = h2o.deeplearning( x = featureNames

                                                         , training_frame = trainingData_hex

                                                         , model_id = “Station1DeepLearningModel”
                                                         , activation = “Tanh”
                                                         , autoencoder = TRUE
                                                         , reproducible = TRUE
                                                         , l1 = 1e-5
                                                         , ignore_const_cols = FALSE
                                                         , seed = 1234
                                                         , hidden = c( 400, 200, 400 ), epochs = 50 )

# Getting the anomalies from training data to set the min MSE( Mean Squared Error )
# value before setting a record as anomally
trainMSE = h2o.anomaly( trainingModel
                                                                 , trainingData_hex
                                                                 , per_feature = FALSE ) )

# Check the first 30 descendent sorted trainMSE records to see our outliers
head( sort( trainMSE$Reconstruction.MSE , decreasing = TRUE ), 30)
# 0.020288603 0.017976305 0.012772556 0.011556780 0.010143009 0.009524983 0.007363854
# 0.005889714 0.005604329 0.005189614[11] 0.005185285 0.005118595 0.004639442 0.004497609
# 0.004438342 0.004419993 0.004298936 0.003961503 0.003651326 0.003426971 0.003367108
# 0.003169319 0.002901914 0.002852006 0.002772110 0.002765924 0.002754586 0.002748887
# 0.002619872 0.002474702

# Ploting errors of reconstructing our training data, to have a graphical view
# of our data reconstruction errors

plot( sort( trainMSE$Reconstruction.MSE ), main = ‘Reconstruction Error’, ylab = “MSE Value.” )

# Seeing the chart and the first 30 decresing sorted MSE records, we can decide .01
# as our min MSE before setting a record as anomally, because we see Just a few
# records with two decimals greater than zero and we can set those as outliers.
# This value is something you must decide for your data.

# Updating trainingData data set with reconstruction error < .01
trainingDataNew = trainingData[ trainMSE$Reconstruction.MSE < .01, ]

h2o.removeAll() ## Remove the data from the h2o cluster in preparation for our final model.

# Convert our new training data frame to H2O format.
trainingDataNew_hex = as.h2o( trainingDataNew, destination_frame = “train_hex” )

# Creating the final model.
trainingModelNew = h2o.deeplearning( x = featureNames
                                                                , training_frame = trainingDataNew_hex
                                                                , model_id = “Station1DeepLearningModel”
                                                                , activation = “Tanh”
                                                                , autoencoder = TRUE
                                                                , reproducible = TRUE
                                                                , l1 = 1e-5
                                                                , ignore_const_cols = FALSE
                                                                , seed = 1234
                                                                , hidden = c( 400, 200, 400 ), epochs = 50 )

# Check our testing data for anomalies.

# Convert our testing data frame to H2O format.
testingDataH2O = as.h2o( testingData, destination_frame = “test_hex” )

# Getting anomalies found in testing data.
testMSE = h2o.anomaly( trainingModelNew
                                                               , testingDataH2O
                                                               , per_feature = FALSE ) )

# Binding our data.
testingData = cbind( MSE = testMSE$Reconstruction.MSE , testingData )

anomalies = testingData[ testingData$MSE >= .01,  ]

if( dim(anomalies)[1] ){
  cat( “Anomalies detected in the sample data, station needs maintenance.” )

Here is the code on github:

Enjoy it!!!.

Carlos Kassab

More information about R:

To leave a comment for the author, please follow the link and comment on their blog: R-Analytics. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

R Packages worth a look

Resampled Data Frames (strapgod)
Create data frames with virtual groups that can be used with ‘dplyr’ to efficiently compute resampled statistics, generate the data for hypothetical ou …

Binscatter Estimation and Inference (binsreg)
Provides tools for statistical analysis using the binscatter methods developed by Cattaneo, Crump, Farrell and Feng (2019a) <arXiv:1902.09608> an …

Bayesian Network Learning Improved Project (r.blip)
Allows the user to learn Bayesian networks from datasets containing thousands of variables. It focuses on score-based learning, mainly the ‘BIC’ and th …

Genetic Approach to Maximize Clustering Criterion (gama)
An evolutionary approach to performing hard partitional clustering. The algorithm uses genetic operators guided by information about the quality of ind …

Distance Object Manipulation Tools (disttools)
Provides convenient methods for accessing the data in ‘dist’ objects with minimal memory and computational overhead. ‘disttools’ can be used to extract …

Utilities to Extract and Process ‘YAML’ Fragments (yum)
Provides a number of functions to facilitate extracting information in ‘YAML’ fragments from one or multiple files, optionally structuring the informat …

Continue Reading…


Read More

Document worth reading: “A Survey of the Usages of Deep Learning in Natural Language Processing”

Over the last several years, the field of natural language processing has been propelled forward by an explosion in the use of deep learning models. This survey provides a brief introduction to the field and a quick overview of deep learning architectures and methods. It then sifts through the plethora of recent studies and summarizes a large assortment of relevant contributions. Analyzed research areas include several core linguistic processing issues in addition to a number of applications of computational linguistics. A discussion of the current state of the art is then provided along with recommendations for future research in the field. A Survey of the Usages of Deep Learning in Natural Language Processing

Continue Reading…


Read More

If you did not already know

WebSeg google
In this paper, we improve semantic segmentation by automatically learning from Flickr images associated with a particular keyword, without relying on any explicit user annotations, thus substantially alleviating the dependence on accurate annotations when compared to previous weakly supervised methods. To solve such a challenging problem, we leverage several low-level cues (such as saliency, edges, etc.) to help generate a proxy ground truth. Due to the diversity of web-crawled images, we anticipate a large amount of ‘label noise’ in which other objects might be present. We design an online noise filtering scheme which is able to deal with this label noise, especially in cluttered images. We use this filtering strategy as an auxiliary module to help assist the segmentation network in learning cleaner proxy annotations. Extensive experiments on the popular PASCAL VOC 2012 semantic segmentation benchmark show surprising good results in both our WebSeg (mIoU = 57.0%) and weakly supervised (mIoU = 63.3%) settings. …

Decreasing-Trend-Nature (DTN) google
We propose a novel diminishing learning rate scheme, coined Decreasing-Trend-Nature (DTN), which allows us to prove fast convergence of the Stochastic Gradient Descent (SGD) algorithm to a first-order stationary point for smooth general convex and some class of nonconvex including neural network applications for classification problems. We are the first to prove that SGD with diminishing learning rate achieves a convergence rate of $\mathcal{O}(1/t)$ for these problems. Our theory applies to neural network applications for classification problems in a straightforward way. …

Super-Resolution Erlangen Database (SupER) google
Capturing ground truth data to benchmark super-resolution (SR) is challenging. Therefore, current quantitative studies are mainly evaluated on simulated data artificially sampled from ground truth images. We argue that such evaluations overestimate the actual performance of SR methods compared to their behavior on real images. To bridge this simulated-to-real gap, we introduce the Super-Resolution Erlangen (SupER) database, the first comprehensive laboratory SR database of all-real acquisitions with pixel-wise ground truth. It consists of more than 80k images of 14 scenes combining different facets: CMOS sensor noise, real sampling at four resolution levels, nine scene motion types, two photometric conditions, and lossy video coding at five levels. As such, the database exceeds existing benchmarks by an order of magnitude in quality and quantity. This paper also benchmarks 19 popular single-image and multi-frame algorithms on our data. The benchmark comprises a quantitative study by exploiting ground truth data and qualitative evaluations in a large-scale observer study. We also rigorously investigate agreements between both evaluations from a statistical perspective. One interesting result is that top-performing methods on simulated data may be surpassed by others on real data. Our insights can spur further algorithm development, and the publicy available dataset can foster future evaluations. …

Continue Reading…


Read More

March 23, 2019

Distilled News

Airflow and superQuery

What is the cost?’ A question asked so frequently in the tech world that every person at a small start-up shudders slightly when it is asked. The answer of which invariably is: ‘We’re not sure’. One of the best tools for scheduling workflows in the data engineering world is Apache Airflow. This tool has taken many a business out of the inflexible cron scheduling doldrums into riding the big data waves on the high seas of Directed Acyclic Graphs (DAGs). Of course this means that large globs of data are being moved into and out of databases and with this glorious movement often comes unavoidable costs. One such database, a supercomputer if you will, is called Google BigQuery. It is the flagship of the Google Cloud offering and allows data processing at the Petabyte scale. It is very good at making you worry less about the power of the database infrastructure and more about the quality of your analysis and data flow problems in need of solving. One key factor to consider with BigQuery is how open any individual or organisation is to driving up costs spent on scanning data on this platform. Even the most savvy of data engineers will tell you in angst about their errors in scanning across data they didn’t really want to and pushing their business’ monthly analysis bill over the budget.

Top 5 Reasons to Move Enterprise Data Science Off the Laptop and to the Cloud

We live in a world that is inundated with data. Data science and machine learning (ML) techniques have come to the rescue in helping enterprises analyze and make sense of these large volumes of data. Enterprises have hired data scientists – people who apply scientific methods to data to build mathematical software models – to generate insights or predictions that enable data-driven business decisions. Typically, data scientists are experts in statistical analysis and mathematical modeling who are proficient in programming languages such as R or Python.

R 3.5.3 now available

The R Core Team announced yesterday the release of R 3.5.3, and updated binaries for Windows and Linux are now available (with Mac sure to follow soon). This update fixes three minor bugs (to the functions writeLines, setClassUnion, and stopifnot), but you might want to upgrade just to avoid the ‘package built under R 3.5.4’ warnings you might get for new CRAN packages in the future.

‘X affects Y’. What does that even mean?

On my last post I gave an intuitive demonstration of what’s causal inference and how it’s different than classic ML. After receiving some feedback I realize that while the post was easy to digest, some confusion remains. In this post I’ll delve a bit deeper into what the ‘causal’ in Causal Inference actually means.

What is the Difference Between AI and Machine Learning

Artificial Intelligence and Machine Learning have empowered our lives to a large extent. The number of advancements made in this space has revolutionized our society and continue making society a better place to live in. In terms of perception, both Artificial Intelligence and Machine Learning are often used in the same context which leads to confusion. AI is the concept in which machine makes smart decisions whereas Machine Learning is a sub-field of AI which makes decisions while learning patterns from the input data. In this blog, we would dissect each term and understand how Artificial Intelligence and Machine Learning are related to each other.

The importance of Graphing Your Data – Anscombe’s Clever Quartet!

Francis Anscombe’s seminal paper on ‘Graphs in Statistical’ analysis (American Statistician, 1973) effectively makes the case that looking at summary statistics of data is insufficient to identify the relationship between variables. He demonstrates this by generating four different data sets (Anscombe’s quartet) which have nearly identical summary statistics. His data have the same mean and variance for x and y, same correlations between x and y, and same regression coefficients on the linear projection of x on y. (There are certainly additional summary statistics less widely reported such as kurtosis or least absolute deviations/median regression which were not reported which would have indicated differences between the data.) Yet even with these differences, without graphing the data, any analysis would likely be missing the mark.

R and labelled data: Using quasiquotation to add variable and value labels

Labelling data is typically a task for end-users and is applied in own scripts or functions rather than in packages. However, sometimes it can be useful for both end-users and package developers to have a flexible way to add variable and value labels to their data. In such cases, quasiquotation is helpful. This vignette demonstrate how to use quasiquotation in sjlabelled to label your data.

Unsupervised Classification Project: Building a Movie Recommender with Clustering Analysis and K-Means

The goal of this project is to find out the similarities within groups of people in order to build a movie recommending system for users. We are going to analyze a dataset from Netflix database to explore the characteristics that people share in movies’ taste, based on how they rate them.

Dockerizing Python Flask app and Conda environment

Use Docker to package your Python Flask app and your Conda environment. This post will describe how to dockerize your Python Flask app and recreate your Conda Python environment. So you are developing a Python Flask app, and you have set up a Conda virtual environment up on your local machine to run your app. Now you want to put your Python Flask app in a Docker image. Wouldn’t it be nice if you could export your current Conda environment as a .yml file, describing what Python version your application is using and which Python libraries are required to run your application. Furthermore, use the exported .yml file to build a similar environment in a Docker image and run your Flask app in that environment. The following will describe exactly how you can acomplishe all the above mentioned.

Virtual, Headless, and Distributed (Oh My!)

This post empowers the Pythonista, with a complete framework to explore the world of data on the internet?-?all behind randomized proxy servers in a fast parallelized sequence, while protecting your company’s immutable IP from curious eyes, and other potential trolls. With this new outlet, the reader is requested to take all measures, and to not abuse the privilege of their acquired ghost-ninja skills, to not tax any such services inappropriately, nor unethically. The user takes all responsibility for implementing (of course) and all risks associated with running the attached code.

Getting started with NLP using the PyTorch framework

PyTorch is one of the most popular Deep Learning frameworks that is based on Python and is supported by Facebook. In this article we will be looking into the classes that PyTorch provides for helping with Natural Language Processing (NLP).

It’s OK to use spreadsheets in data science

With all the great sophisticated data tools that exist out there these days, it’s easy to think that spreadsheets are too primitive for use in serious data science work. The fact that there’s literally 20+ years of literature cautioning people about the evils of spreadsheets makes it sound like a ‘real data professional’ should know better than to use such antiquated things. But it’s probably the greatest Swiss army chainsaw for data for the sorts of ugly work that no one ever wants to admit they have to do every day. In an ideal world they wouldn’t be necessary, but when there’s a combination of tech debt, time pressure, poor data quality, and stakeholders who don’t know anything but spreadsheets, they’re invaluable.

Image-to-Image Translation

Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image. It can be applied to a wide range of applications, such as collection style transfer, object transfiguration,season transfer and photo enhancement.

EM Algorithm Explained in One Picture

The EM algorithm finds maximum-likelihood estimates for model parameters when you have incomplete data. The ‘E-Step’ finds probabilities for the assignment of data points, based on a set of hypothesized probability density functions; The ‘M-Step’ updates the original hypothesis with new data. The cycle repeats until the parameters stabilize.

Top 10 Artificial Intelligence Trends in 2019

1. Automation of DevOps to achieve AIOps
2. The Emergence of More Machine Learning Platforms
3. Augmented Reality
4. Agent-Based Simulations
5. IoT
6. AI Optimized Hardware
7. Natural Language Generation
8. Streaming Data Platforms
9. Driverless Vehicles
10. Conversational BI and Analytics

Continue Reading…


Read More

Book Memo: “Reproducible Econometrics Using R”

This book is designed to facilitate reproducibility in Econometrics. It does so by using open source software (R) and recently developed tools (R Markdown and bookdown) that allow the reader to engage in reproducible research. Illustrative examples are provided throughout, and a range of topics are covered. Assignments, exams, slides, and a solution manual are available for instructors.

Continue Reading…


Read More

How cdata Control Table Data Transforms Work

With all of the excitement surrounding cdata style control table based data transforms (the cdata ideas being named as the “replacements” for tidyr‘s current methodology, by the tidyr authors themselves!) I thought I would take a moment to describe how they work.

cdata defines two primary data manipulation operators: rowrecs_to_blocks() and blocks_to_rowrecs(). These are the fundamental transforms that convert between data representations. The two representations it converts between are:

  • A world where all facts about an instance or record are in a single row (“rowrecs”).
  • A world where all facts about an instance or record are in groups of rows (“blocks”).

It turns out once you develop the idea of specifying the data transformation as explicit data (an application of Erick S. Raymond’s admonition: “fold knowledge into data, so program logic can be stupid and robust.”), you have also a great tool for reasoning and teaching data transforms.

For example:

rowrecs_to_blocks() does the following. For each row record, make a replicant of the of the control table with values filled in. In relational terms rowrecs_to_blocks() is therefore a join of the data to the control table. Conversely blocks_to_rowrecs() combines groups of rows into single rows, so in relational terms it is an aggregation or projection. If each of these operations is faithful (keeps enough information around) they are then inverse of each other.

We share some nifty tutorials on the ideas here:

One can build fairly clever illustrations and animations to teach the above.

The most common special cases of the above have been popularized in R as unpivot/pivot (pivot invented by Pito Salas), stack/unstack, melt/cast, or gather/spread. These special cases are handled in cdata by convenience functions unpivot_to_blocks() and pivot_to_rowrecs(). A great example of a “higher order” transform that isn’t one of the common ones is given here.

Note: the above theory and implementation is joint work of Nina Zumel and John Mount and can be found here. We would really appreciate any citations or credit you can send our way (or even politely correcting those who don’t attribute the work or attribute the work to others, as there are already a lot of mentions without credit or citation).


To cite package ‘cdata’ in publications use:

  John Mount and Nina Zumel (2019). cdata: Fluid Data Transformations.,

A BibTeX entry for LaTeX users is

    title = {cdata: Fluid Data Transformations},
    author = {John Mount and Nina Zumel},
    year = {2019},
    note = {,},

Continue Reading…


Read More

Science and Technology links (March 23rd 2019)

  1. Half of American households subscribe to “Amazon Prime”, a “club membership” for Amazon customers with monthly fees. And about half of these subscribes buy something from Amazon every week. If you are counting, this seems to imply that at least a quarter of all American households order something from Amazon every week.
  2. How do the preprints that researchers post online freely differ from genuine published articles that underwent peer review? Maybe less than you’d expect:

    our results show that quality of reporting in preprints in the life sciences is within a similar range as that of peer-reviewed articles

  3. Very low meat consumption might increase the long-term risk of dementia and Alzheimer’s.
  4. We appear to be no closer to find a cure for Alzheimer’s despite billions being spent each year in research and clinical trials. Lower writes:

    Something is wrong with the way we’re thinking about Alzheimer’s (…) It’s been wrong for a long time and that’s been clear for a long time. Do something else.

  5. Many researchers use “p values” (a statistical measure) to prove that their results are “significant”. Ioannidis argues that most research should not rely on p values.
  6. Eating nuts improves cognition (nuts make you smart).
  7. As we age, we become more prone to diabetes. According to an article in Nature, senescent cells in the immune system may lead to diabetes. Senescent cells that are cells that should be dead due to damage or too many divisions, but they refuse to die.
  8. Hospitalizations for heart attacks have declined by 38% in the last 20 years and mortality is at all time low. Though clinicians and health professionals take the credit, I am not convinced we understand the source of this progress.
  9. In stories, females identify more strongly with their own gender whereas males identify equally with either gender.
  10. Theranos was a large company that pretended to be able to do better blood tests. The company was backed by several granted patents. Yet we know that Theranos technology did not work. The problem we are facing now is that Theranos patents, granted on false pretenses and vague claims, remain valid and will hurt genuine inventors in the future. If we are to have patents at all, they should only be granted for inventions that work. Nazer argues that the patent system is broken.
  11. Smaller groups tend to create more innovative work, and larger groups less so.
  12. The bones of older people become fragile. A leading cause of this problem is the fact stem cells in our bones become less active. It appears that this is caused by excessive inflammation. We can create it in young mice by exposing them to the blood serum of old mice. We can also reverse it in old mice by using an anti-inflammatory drug (akin to aspirin).
  13. Gene therapy helped mice regain sight lost due to retinal degeneration. It could work in human beings too.
  14. Based on ecological models, scientists predicted over ten years ago that polar bear populations would soon collapse. That has not happened: there may be several times more polar bears than decades ago. It is true that ice coverage is lower than it has been historically due to climate change, but it is apparently incorrect to assume that polar bears need thick ice; they may in fact thrive when the ice is thin and the summers are long. Crowford, a zoologist and professor at the University of Victory tells the tale in her book The Polar Bear Catastrophe That Never Happened.

Continue Reading…


Read More

Yes, I really really really like fake-data simulation, and I can’t stop talking about it.

Rajesh Venkatachalapathy writes:

Recently, I had a conversation with a colleague of mine about the virtues of synthetic data and their role in data analysis. I think I’ve heard a sermon/talk or two where you mention this and also in your blog entries. But having convinced my colleague of this point, I am struggling to find good references on this topic.

I was hoping to get some leads from you.

My reply:

Hi, here are some refs: from 2009, 2011, 2013, also this and this and this from 2017, and this from 2018. I think I’ve missed a few, too.

If you want something in dead-tree style, see Section 8.1 of my book with Jennifer Hill, which came out in 2007.

Or, for some classic examples, there’s Bush and Mosteller with the “stat-dogs” in 1954, and Ripley with his simulated spatial processes from, ummmm, 1987 I think it was? Good stuff, all. We should be doing more of it.

Continue Reading…


Read More

Strength of a Lennon song exposed with R function glue::glue

(This article was first published on Guillaume Pressiat, and kindly contributed to R-bloggers)

love_verse <- function(w1, w2, w3){
  "Love is {b}, {b} is love
   Love is {y}, {y} love
   Love is {u} to be loved", 
  b = w1, y = w2, u = w3)

As a return, parameters sometimes gives echoes of poetry.

love_verse('real', 'feeling', 'wanting')
Love is real, real is love
Love is feeling, feeling love
Love is wanting to be loved
love_verse('touch', 'reaching', 'asking')
Love is touch, touch is love
Love is reaching, reaching love
Love is asking to be loved
## refrain
Love is you
You and me
Love is knowing
We can be
love_verse('free', 'living', 'needing')
Love is free, free is love
Love is living, living love
Love is needing to be loved

list(list(w1 = 'real',  w2 = 'feeling',  w3 = 'wanting'),
     list(w1 = 'touch', w2 = 'reaching', w3 = 'asking' ),
     list(w1 = 'free',  w2 = 'living',   w3 = 'needing')) %>% 
  purrr::map(function(x), x))
Love is real, real is love
Love is feeling, feeling love
Love is wanting to be loved

Love is touch, touch is love
Love is reaching, reaching love
Love is asking to be loved

Love is free, free is love
Love is living, living love
Love is needing to be loved

We could also read title of this article as “strength of an R function exposed with a Lennon song”…

To leave a comment for the author, please follow the link and comment on their blog: Guillaume Pressiat. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

A decade of the Datablog: 'There's a human story behind every data point'

The Guardian’s data editors in the UK, US and Australia explain how their work has influenced our journalism

The Datablog was launched in March 2009, starting in a corner of the Guardian website dedicated to the publication and analysis of data. In the last decade it has published thousands of stories and datasets on every topic imaginable, from Reading the Riots to how the UK fared in every Eurovision song contest, and its influence lives on throughout our data journalism.

How did it all begin? This is what its founder, Simon Rogers, remembers:

Continue reading...

Continue Reading…


Read More

Document worth reading: “Some techniques in density estimation”

Density estimation is an interdisciplinary topic at the intersection of statistics, theoretical computer science and machine learning. We review some old and new techniques for bounding sample complexity of estimating densities of continuous distributions, focusing on the class of mixtures of Gaussians and its subclasses. Some techniques in density estimation

Continue Reading…


Read More

Thanks for reading!