# My Data Science Blogs

## March 26, 2019

### Pedestrian Detection in Aerial Images Using RetinaNet

Object Detection in Aerial Images is a challenging and interesting problem. By using Keras to train a RetinaNet model for object detection in aerial images, we can use it to extract valuable information.

### Most Americans like big businesses.

Why is there so much suspicion of big business?

Perhaps in part because we cannot do without business, so many people hate or resent business, and they love to criticize it, mock it, and lower its status. Business just bugs them. . . .

The short answer is, No, I don’t think there is so much suspicion of big business in this country. No, I don’t think people love to criticize, mock and lower the status of big business.

This came up a few years ago, and at the time I pulled out data from a 2007 survey showing that just about every big business you could think of was popular, with the only exception being oil companies. Microsoft, Walmart, Citibank, GM, Pfizer: you name it, the survey respondents were overwhelmingly positive.

Nearly two-thirds of respondents say corporate profits are too high, but, “more than seven in ten agree that ‘the strength of this country today is mostly based on the success of American business’ – an opinion that has changed very little over the past 20 years.”

Corporations are more popular with Republicans than with Democrats, but most of the corporations in the survey were popular with a clear majority in either party.

Big business does lots of things for us, and the United States is a proudly capitalist country, so it’s no shocker that most businesses in the survey were very popular.

So maybe the question is, Why did an economist such as Cowen think that people view big business so negatively?

My quick guess is that we notice negative statements more than positive statements. Cowen himself roots for big business, he’s generally on the side of big business, so when he sees any criticism of it, he bristles. He notices the criticism and is bothered by it. When he sees positive statements about big business, that all seems so sensible that perhaps he hardly notices. The negative attitudes are jarring to him so more noticeable. Perhaps in the same way that I notice bad presentations of data. An ugly table or graph is to me like fingernails on the blackboard.

Anyway, it’s perfectly reasonable for Cowen to be interested in those people who “hate or resent business, and they love to criticize it, mock it, and lower its status.” We should just remember that, at least from these survey data, it seems that this is a small minority of people.

Why did I write this post?

The bigger point here is that this is an example of something I see a lot, which is a social scientist or pundit coming up with theories to explain some empirical pattern in the world, but it turns out the pattern isn’t actually real. This came up years ago with Red State Blue State, when I noticed journalists coming up with explanations for voting patterns that were not happening (see for example here) and of course it comes up a lot with noise-mining research, whether it be a psychologist coming up with theories to explain ESP, or a sociologist coming up with theories to explain spurious patterns in sex ratios.

It’s fine to explain data; it’s just important to be aware of what’s being explained. In the context of the above-linked Cowen post, it’s fine to answer the question, “If business is so good, why is it so disliked?”—as long as this sentence is completed as follows: “If business is so good, why is it so disliked by a minority of Americans?” Explaining minority positions is important; we should just be clear it’s a minority.

Or of course it’s possible that Cowen has access to other data I haven’t looked at, perhaps more recent surveys that would modify my empirical understanding. That would be fine too.

P.S. The title of this post was originally “Most Americans like big business.” I changed the last word to “businesses” in response to comments who pointed out that most Americans express negative views about “big business” in general, but they like most individual big businesses that they’re asked about.

### Data Science for Decision Makers: A Discussion with Dr Stelios Kampakis

This article contains an interview veteran data scientist, Dr Stylianos (Stelios) Kampakis, in which he discusses his career, and how he helps decision makers across a range of businesses understand how data science can benefit them.

### The Stages of Relationships, Distributed

Everyone's relationship timeline is a little different. This animation plays out real-life paths to marriage. Read More

### Four short links: 26 March 2019

Software Stack, Gig Economy, Simple Over Flexible, and Packet Radio

1. Thoughts on Conway's Law and the Software Stack (Jessie Frazelle) -- All these problems are not small by any means. They are miscommunications at various layers of the stack. They are people thinking an interface or feature is secure when it is merely a window dressing that can be bypassed with just a bit more knowledge about the stack. I really like the advice Lea Kissner gave: “take the long view, not just the broad view.” We should do this more often when building systems.
2. Troubles with the Open Source Gig Economy and Sustainability Tip Jar (Chris Aniszczyk) -- thoughtful long essay with a lot of links for background reading, on the challenges of sustainability via Patreon, etc., through to some signs of possibly-working models.
3. Choose Simple Solutions Over Flexible Ones -- flexibility does not come for free.
4. New Packet Radio (Hackaday) -- a custom radio protocol, designed to transport bidirectional IP traffic over 430MHz radio links (ham radio). This protocol is optimized for "point to multipoint" topology, with the help of managed-TDMA. Note that Hacker News commentors indicate some possible FCC violations; though, as the project comes from France, that's probably not a problem for the creators of the software.

### Whats new on arXiv

These days usage of mobile applications has become quite a normal part of our lives, since every day we use our smartphones for communication, entertainment, business and even education. A high demand on various apps has led to significant growth of supply. Large number of apps offered, in turn, has led to complications in user’s search of the one perfectly suitable application. In this paper the authors have made an attempt to solve the problem of facilitating the search in app stores. With the help of a websites crawling software a sample of data was retrieved from one of the well-known mobile app stores and divided into 11 groups by types. Afterwards these groups of data were used to construct a Knowledge Schema – a graphic model of interconnections of data that characterize any mobile app in the selected store. This Schema creation is the first step in the process of developing a Knowledge Graph that will perform applications grouping to facilitate users search in app stores.
While deep neural networks have achieved state-of-the-art performance across a large number of complex tasks, it remains a big challenge to deploy such networks for practical, on-device edge scenarios such as on mobile devices, consumer devices, drones, and vehicles. In this study, we take a deeper exploration into a human-machine collaborative design approach for creating highly efficient deep neural networks through a synergy between principled network design prototyping and machine-driven design exploration. The efficacy of human-machine collaborative design is demonstrated through the creation of AttoNets, a family of highly efficient deep neural networks for on-device edge deep learning. Each AttoNet possesses a human-specified network-level macro-architecture comprising of custom modules with unique machine-designed module-level macro-architecture and micro-architecture designs, all driven by human-specified design requirements. Experimental results for the task of object recognition showed that the AttoNets created via human-machine collaborative design has significantly fewer parameters and computational costs than state-of-the-art networks designed for efficiency while achieving noticeably higher accuracy (with the smallest AttoNet achieving ~1.8% higher accuracy while requiring ~10x fewer multiply-add operations and parameters than MobileNet-V1). Furthermore, the efficacy of the AttoNets is demonstrated for the task of instance-level object segmentation and object detection, where an AttoNet-based Mask R-CNN network was constructed with significantly fewer parameters and computational costs (~5x fewer multiply-add operations and ~2x fewer parameters) than a ResNet-50 based Mask R-CNN network.
This work explores a novel approach for adaptive, differentiable parametrization of large-scale non-stationary random fields. Coupled with any gradient-based algorithm, the method can be applied to variety of optimization problems, including history matching. The developed technique is based on principal component analysis (PCA), but, in contrast to other PCA-based methods, allows to amend parametrization process regarding objective function behaviour.
Finding a template in a search image is one of the core problems many computer vision, such as semantic image semantic, image-to-GPS verification \etc. We propose a novel quality-aware template matching method, QATM, which is not only used as a standalone template matching algorithm, but also a trainable layer that can be easily embedded into any deep neural network. Specifically, we assess the quality of a matching pair using soft-ranking among all matching pairs, and thus different matching scenarios such as 1-to-1, 1-to-many, and many-to-many will be all reflected to different values. Our extensive evaluation on classic template matching benchmarks and deep learning tasks demonstrate the effectiveness of QATM. It not only outperforms state-of-the-art template matching methods when used alone, but also largely improves existing deep network solutions.
Video anomaly detection under weak labels is formulated as a typical multiple-instance learning problem in previous works. In this paper, we provide a new perspective, i.e., a supervised learning task under noisy labels. In such a viewpoint, as long as cleaning away label noise, we can directly apply fully supervised action classifiers to weakly supervised anomaly detection, and take maximum advantage of these well-developed classifiers. For this purpose, we devise a graph convolutional network to correct noisy labels. Based upon feature similarity and temporal consistency, our network propagates supervisory signals from high-confidence snippets to low-confidence ones. In this manner, the network is capable of providing cleaned supervision for action classifiers. During the test phase, we only need to obtain snippet-wise predictions from the action classifier without any extra post-processing. Extensive experiments on 3 datasets at different scales with 2 types of action classifiers demonstrate the efficacy of our method. Remarkably, we obtain the frame-level AUC score of 82.12% on UCF-Crime.
In this paper we present a novel quantum algorithm, namely the quantum grid search algorithm, to solve a special search problem. Suppose $k$ non-empty buckets are given, such that each bucket contains some marked and some unmarked items. In one trial an item is selected from each of the $k$ buckets. If every selected item is a marked item, then the search is considered successful. This search problem can also be formulated as the problem of finding a ‘marked path’ associated with specified bounds on a discrete grid. Our algorithm essentially uses several Grover search operators in parallel to efficiently solve such problems. We also present an extension of our algorithm combined with a binary search algorithm in order to efficiently solve global trajectory optimization problems. Estimates of the expected run times of the algorithms are also presented, and it is proved that our proposed algorithms offer exponential improvement over pure classical search algorithms, while a traditional Grover’s search algorithm offers only a quadratic speedup. We note that this gain comes at the cost of increased complexity of the quantum circuitry. The implication of such exponential gains in performance is that many high dimensional optimization problems, which are intractable for classical computers, can be efficiently solved by our proposed quantum grid search algorithm.
Capsule Networks (CN) offer new architectures for Deep Learning (DL) community. Though demonstrated its effectiveness on MNIST and smallNORB datasets, the networks still face a lot of challenges on other datasets for images with different levels of background. In this research, we improve the design of CN (Vector version) and perform experiments to compare accuracy and speed of CN versus DL models. In CN, we resort to more Pooling layers to filter Input images and extend Reconstruction layers to make better image restoration. In DL models, we utilize Inception V3 and DenseNet V201 for demanding computers beside NASNet, MobileNet V1 and MobileNet V2 for small and embedded devices. We evaluate our models on a fingerspelling alphabet dataset from American Sign Language (ASL). The results show that CNs perform comparably to DL models while dramatically reduce training time. We also make a demonstration for the purpose of illustration.
Because large, human-annotated datasets suffer from labeling errors, it is crucial to be able to train deep neural networks in the presence of label noise. While training image classification models with label noise have received much attention, training text classification models have not. In this paper, we propose an approach to training deep networks that is robust to label noise. This approach introduces a non-linear processing layer (noise model) that models the statistics of the label noise into a convolutional neural network (CNN) architecture. The noise model and the CNN weights are learned jointly from noisy training data, which prevents the model from overfitting to erroneous labels. Through extensive experiments on several text classification datasets, we show that this approach enables the CNN to learn better sentence representations and is robust even to extreme label noise. We find that proper initialization and regularization of this noise model is critical. Further, by contrast to results focusing on large batch sizes for mitigating label noise for image classification, we find that altering the batch size does not have much effect on classification performance.
This report first provides a brief overview of a number of supervised learning algorithms for regression tasks. Among those are neural networks, regression trees, and the recently introduced Nexting. Nexting has been presented in the context of reinforcement learning where it was used to predict a large number of signals at different timescales. In the second half of this report, we apply the algorithms to historical weather data in order to evaluate their suitability to forecast a local weather trend. Our experiments did not identify one clearly preferable method, but rather show that choosing an appropriate algorithm depends on the available side information. For slowly varying signals and a proficient number of training samples, Nexting achieved good results in the studied cases.
We consider the problem of path inference: given a path prefix, i.e., a partially observed sequence of nodes in a graph, we want to predict which nodes are in the missing suffix. In particular, we focus on natural paths occurring as a by-product of the interaction of an agent with a network—a driver on the transportation network, an information seeker in Wikipedia, or a client in an online shop. Our interest is sparked by the realization that, in contrast to shortest-path problems, natural paths are usually not optimal in any graph-theoretic sense, but might still follow predictable patterns. Our main contribution is a graph neural network called Gretel. Conditioned on a path prefix, this network can efficiently extrapolate path suffixes, evaluate path likelihood, and sample from the future path distribution. Our experiments with GPS traces on a road network and user-navigation paths in Wikipedia confirm that Gretel is able to adapt to graphs with very different properties, while also comparing favorably to previous solutions.
In spite of the amazing results obtained by deep learning in many applications, a real intelligent behavior of an agent acting in a complex environment is likely to require some kind of higher-level symbolic inference. Therefore, there is a clear need for the definition of a general and tight integration between low-level tasks, processing sensorial data that can be effectively elaborated using deep learning techniques, and the logic reasoning that allows humans to take decisions in complex environments. This paper presents LYRICS, a generic interface layer for AI, which is implemented in TersorFlow (TF). LYRICS provides an input language that allows to define arbitrary First Order Logic (FOL) background knowledge. The predicates and functions of the FOL knowledge can be bound to any TF computational graph, and the formulas are converted into a set of real-valued constraints, which participate to the overall optimization problem. This allows to learn the weights of the learners, under the constraints imposed by the prior knowledge. The framework is extremely general as it imposes no restrictions in terms of which models or knowledge can be integrated. In this paper, we show the generality of the approach showing some use cases of the presented language, including generative models, logic reasoning, model checking and supervised learning.
Understanding the major fraud problems in the world and interpreting the data available for analysis is a current challenge that requires interdisciplinary knowledge to complement the knowledge of computer professionals. Collaborative events (called Hackathons, Datathons, Codefests, Hack Days, etc.) have become relevant in several fields. Examples of fields which are explored in these events include startup development, open civic innovation, corporate innovation, and social issues. These events have features that favor knowledge exchange to solve challenges. In this paper, we present an event format called Short Datathon, a Hackathon for the development of exploratory data analysis and visualization skills. Our goal is to evaluate if participating in a Short Datathon can help participants learn basic data analysis and visualization concepts. We evaluated the Short Datathon in two case studies, with a total of 20 participants, carried out at the Federal University of Technology – Paran\’a. In both case studies we addressed the issue of tax evasion using real world data. We describe, as a result of this work, the qualitative aspects of the case studies and the perception of the participants obtained through questionnaires. Participants stated that the event helped them understand more about data analysis and visualization and that the experience with people from other areas during the event made data analysis more efficient. Further studies are necessary to evolve the format of the event and to evaluate its effectiveness.
In the recent political climate, the topic of news quality has drawn attention both from the public and the academic communities. The growing distrust of traditional news media makes it harder to find a common base of accepted truth. In this work, we design and build MediaRank (www.media-rank.com), a fully automated system to rank over 50,000 online news sources around the world. MediaRank collects and analyzes one million news webpages and two million related tweets everyday. We base our algorithmic analysis on four properties journalists have established to be associated with reporting quality: peer reputation, reporting bias / breadth, bottomline financial pressure, and popularity. Our major contributions of this paper include: (i) Open, interpretable quality rankings for over 50,000 of the world’s major news sources. Our rankings are validated against 35 published news rankings, including French, German, Russian, and Spanish language sources. MediaRank scores correlate positively with 34 of 35 of these expert rankings. (ii) New computational methods for measuring influence and bottomline pressure. To the best of our knowledge, we are the first to study the large-scale news reporting citation graph in-depth. We also propose new ways to measure the aggressiveness of advertisements and identify social bots, establishing a connection between both types of bad behavior. (iii) Analyzing the effect of media source bias and significance. We prove that news sources cite others despite different political views in accord with quality measures. However, in four English-speaking countries (US, UK, Canada, and Australia), the highest ranking sources all disproportionately favor left-wing parties, even when the majority of news sources exhibited conservative slants.
Bayesian neural networks (BNNs) have recently regained a significant amount of attention in the deep learning community due to the development of scalable approximate Bayesian inference techniques. There are several advantages of using Bayesian approach: Parameter and prediction uncertainty become easily available, facilitating rigid statistical analysis. Furthermore, prior knowledge can be incorporated. However so far there have been no scalable techniques capable of combining both model (structural) and parameter uncertainty. In this paper we introduce the concept of model uncertainty in BNNs and hence make inference in the joint space of models and parameters. Moreover, we suggest an adaptation of a scalable variational inference approach with reparametrization of marginal inclusion probabilities to incorporate the model space constraints. Finally, we show that incorporating model uncertainty via Bayesian model averaging and Bayesian model selection allows to drastically sparsify the structure of BNNs without significant loss of predictive power.
Machine learning algorithms are increasingly involved in sensitive decision-making process with adversarial implications on individuals. This paper presents mdfa, an approach that identifies the characteristics of the victims of a classifier’s discrimination. We measure discrimination as a violation of multi-differential fairness. Multi-differential fairness is a guarantee that a black box classifier’s outcomes do not leak information on the sensitive attributes of a small group of individuals. We reduce the problem of identifying worst-case violations to matching distributions and predicting where sensitive attributes and classifier’s outcomes coincide. We apply mdfa to a recidivism risk assessment classifier and demonstrate that individuals identified as African-American with little criminal history are three-times more likely to be considered at high risk of violent recidivism than similar individuals but not African-American.

### Visualizing the 80/20 rule, with the bar-density plot

Through Twitter, Danny H. submitted the following chart that shows a tiny 0.3 percent of Youtube creators generate almost 40 percent of all viewing on the platform. He asks for ideas about how to present lop-sided data that follow the "80/20" rule.

In the classic 80/20 rule, 20 percent of the units account for 80 percent of the data. The percentages vary, so long as the first number is small relative to the second. In the Youtube example, 0.3 percent is compared to 40 percent. The underlying reason for such lop-sidedness is the differential importance of the units. The top units are much more important than the bottom units, as measured by their contribution to the data.

I sense a bit of "loss aversion" on this chart (explained here). The designer color-coded the views data into blue, brown and gray but didn't have it in him/her to throw out the sub-categories, which slows down cognition and adds hardly to our understanding.

I like the chart title that explains what it is about.

Turning to the D corner of the Trifecta Checkup for a moment, I suspect that this chart only counts videos that have at least one play. (Zero-play videos do not show up in a play log.) For a site like Youtube, a large proportion of uploaded videos have no views and thus, many creators also have no views.

***

My initial reaction on Twitter is to use a mirrored bar chart, like this:

I ended up spending quite a bit of time exploring other concepts. In particular, I like to find an integrated way to present this information. Most charts, such as the mirrored bar chart, a Bumps chart (slopegraph), and Lorenz chart, keep the two series of percentages separate.

Also, the biggest bar (the gray bar showing 97% of all creators) highlights the least important Youtubers while the top creators ("super-creators") are cramped inside a slither of a bar, which is invisible in the original chart.

What I came up with is a bar-density plot, where I use density to encode the importance of creators, and bar lengths to encode the distribution of views.

Each bar is divided into pieces, with the number of pieces proportional to the number of creators in each segment. This has the happy result that the super-creators are represented by large (red) pieces while the least important creators by little (gray) pieces.

The embedded tessellation shows the structure of the data: the bottom third of the views are generated by a huge number of creators, producing a few views each - resulting in a high density. The top 38% of the views correspond to a small number of super-creators - appropriately shown by a bar of low density.

For those interested in technicalities, I embed a Voronoi diagram inside each bar, with randomly placed points. (There will be a companion post later this week with some more details, and R code.)

Here is what the bar-density plot looks like when the distribution is essentially uniform:

The density inside each bar is roughly the same, indicating that the creators are roughly equally important.

P.S. The next post on the bar-density plot, with some experimental R code, will be available here.

### Bar-density and pie-density plots for showing relative proportions

In my last post, I described a bar-density chart to show paired data of proportions with an 80/20-type rule. The following example illustrates that a small proportion of Youtubers generate a large proportion of views.

Other examples of this type of data include:

• the top 10% of families own 75% of U.S. household wealth (link)
• the top 1% of artists earn 77% of recorded music income (link)
• Five percent of AT&T customers consume 46% of the bandwidth (link)

In all these examples, the message of the data is the importance of a small number of people (top earners, superstars, bandwidth hogs). A good visual should call out this message.

The bar-density plot consists of two components:

• the bar chart which shows the distribution of the data (views, wealth, income, bandwidth) among segments of people;
• The embedded Voronoi diagram within each bar that encodes the relative importance of each people segment, as measured by the (inverse) density of the population among these segments - a people segment is more important if each individual accounts for more of the data, or in other words, the density of people within the group is lower.

The bar chart can adopt a more conventional horizontal layout.

Voronoi tessellation

To understand the Voronoi diagram, think of a fixed number (say, 100) of randomly placed points inside a bar. Then, for any point inside the bar area, it has a nearest neighbor among those 100 fixed points. Assign every point on the surface to its nearest neighbor. From this, one can draw a boundary around each of the 100 points to include all its nearest neighbors. The resulting tessellation is the Voronoi diagram. (The following illustration comes from this AMS column.)

The density of points in the respective bars encodes the relative proportions of people within those groups. For my example, I placed 6 points in the red bar, 666 points in the yellow bar, and ~2000 points in the gray bar, which precisely represents the relative proportions of creators in the three segments.

Density is represented statistically

Notice that the density is represented statistically, not empirically. According to the annotation on the original chart, the red bar represents 14,000 super-creators. Correspondingly, there are 4.5 million creators in the gray bar. Any attempt to plot those as individual pieces will result in a much less impactful graphic. If the representation is interpreted statistically, as relative densities within each people segment, the message of relative importance of the units within each group is appropriately conveyed.

A more sophisticated way of deciding how many points to place in the red bar is to be developed. Here, I just used the convenient number of 6.

The color shades are randomly applied to the tessellation pieces, and used to facilitate reading of densities.

***

In this section, I provide R code for those who want to explore this some more. This is code used for prototyping, and you're welcome to improve them. The general strategy is as follows:

• Set the rectangular area (bar) in which the Voronoi diagram is to be embedded. The length of the bar is set to the proportion of views, appropriately scaled. The code utilizes the dirichlet function within the spatstat package to generate the fixed points; this requires setting up the owin parameter to represent a rectangle.
• Set the number of points (n) to be embedded in the bar, determined by the relative proportion of creators, appropriately scaled. Generate a data frame containing the x-y coordinates of n randomly placed points, within the rectangle defined above.
• Use the ppp function to generate the Voronoi data
• Set up a colormap for plotting the Voronoi diagram
• Plot the Voronoi diagram; assign shades at random to the pieces (in a production code, these random numbers should be set as marks in the ppp but it's easier to play around with the shades if placed here)

The code generates separate charts for each bar segment. A post-processing step is currently required to align the bars to attain equal height. I haven't figured out whether the multiplot option helps here.

library(spatstat)

# enter the scaled proportions of creators and views
# the Youtube example has three creator segments

# number of randomly generated points should be proportional to proportion of creators. Multiply nc by a scaling factor if desired

nc = c(3, 33, 965)*2

# bar widths should be proportional to proportion of views
# total width should be set based on the width of your page

wide = c(378, 276, 346)/2

# set bar height, to attain a particular aspect ratio
bar_h = 50

# define function to generate points
# defines rectangular window

makepoints = function (n, wide, height) {
df <- data.frame(x = runif(n,0,wide),y = runif(n,0,height))
W <- owin( c(0, wide), c(0,height) ) # rectangular window
pp1 <- as.ppp( df, W )
y <- dirichlet(pp1)

### Document worth reading: “Taking Human out of Learning Applications: A Survey on Automated Machine Learning”

Machine learning techniques have deeply rooted in our everyday life. However, since it is knowledge- and labor-intensive to pursuit good learning performance, human experts are heavily engaged in every aspect of machine learning. In order to make machine learning techniques easier to apply and reduce the demand for experienced human experts, automatic machine learning~(AutoML) has emerged as a hot topic of both in industry and academy. In this paper, we provide a survey on existing AutoML works. First, we introduce and define the AutoML problem, with inspiration from both realms of automation and machine learning. Then, we propose a general AutoML framework that not only covers almost all existing approaches but also guides the design for new methods. Afterward, we categorize and review the existing works from two aspects, i.e., the problem setup and the employed techniques. Finally, we provide a detailed analysis of AutoML approaches and explain the reasons underneath their successful applications. We hope this survey can serve as not only an insightful guideline for AutoML beginners but also an inspiration for future researches. Taking Human out of Learning Applications: A Survey on Automated Machine Learning

### What it the interpretation of the diagonal for a ROC curve

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

Last Friday, we discussed the use of ROC curves to describe the goodness of a classifier. I did say that I will post a brief paragraph on the interpretation of the diagonal. If you look around some say that it describes the “strategy of randomly guessing a class“, that it is obtained with “a diagnostic test that is no better than chance level“, even obtained by “making a prediction by tossing of an unbiased coin“.

Let us get back to ROC curves to illustrate those points. Consider a very simple dataset with 10 observations (that is not linearly separable)

 1 2 3 4  x1 = c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85) x2 = c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3) y = c(1,1,1,1,1,0,0,1,0,0) df = data.frame(x1=x1,x2=x2,y=as.factor(y))

here we can check that, indeed, it is not separable

 1  plot(x1,x2,col=c("red","blue")[1+y],pch=19)

Consider a logistic regression (the course is on linear models)

 1  reg = glm(y~x1+x2,data=df,family=binomial(link = "logit"))

but any model here can be used… We can use our own function

 1 2 3 4 5 6 7 8 9 10  Y=df$y S=predict(reg) roc.curve=function(s,print=FALSE){ Ps=(S>=s)*1 FP=sum((Ps==1)*(Y==0))/sum(Y==0) TP=sum((Ps==1)*(Y==1))/sum(Y==1) if(print==TRUE){ print(table(Observed=Y,Predicted=Ps)) } vect=c(FP,TP) names(vect)=c("FPR","TPR") return(vect) } or any R package actually  1 2  library(ROCR) perf=performance(prediction(S,Y),"tpr","fpr") We can plot the two simultaneously here  1 2 3 4  plot(performance(prediction(S,Y),"tpr","fpr")) V=Vectorize(roc.curve)(seq(-5,5,length=251)) points(V[1,],V[2,]) segments(0,0,1,1,col="light blue") So our code works just fine, here. Let us consider various strategies that should lead us to the diagonal. The first one is : everyone has the same probability (say 50%)  1 2 3 4  S=rep(.5,10) plot(performance(prediction(S,Y),"tpr","fpr")) V=Vectorize(roc.curve)(seq(0,1,length=251)) points(V[1,],V[2,]) Indeed, we have the diagonal. But to be honest, we have only two points here : $(0,0)$ and $(1,1)$. Claiming that we have a straight line is not very satisfying… Actually, note that we have this situation whatever the probability we choose  1 2 3 4  S=rep(.2,10) plot(performance(prediction(S,Y),"tpr","fpr")) V=Vectorize(roc.curve)(seq(0,1,length=251)) points(V[1,],V[2,]) We can try another strategy, like “making a prediction by tossing of an unbiased coin“. This is what we obtain  1 2 3 4 5 6  set.seed(1) S=sample(0:1,size=10,replace=TRUE) plot(performance(prediction(S,Y),"tpr","fpr")) V=Vectorize(roc.curve)(seq(0,1,length=251)) points(V[1,],V[2,]) segments(0,0,1,1,col="light blue") We can also try some sort of “random classifier”, where we choose the score randomly, say uniform on the unit interval  1 2 3 4 5 6  set.seed(1) S=runif(10) plot(performance(prediction(S,Y),"tpr","fpr")) V=Vectorize(roc.curve)(seq(0,1,length=251)) points(V[1,],V[2,]) segments(0,0,1,1,col="light blue") Let us try to go further on that one. For convenience, let us consider another function to plot the ROC curve  1 2  V=Vectorize(roc.curve)(seq(0,1,length=251)) roc_curve=Vectorize(function(x) max(V[2,which(V[1,]<=x)])) We have the same line as previously  1 2 3  x=seq(0,1,by=.025) y=roc_curve(x) lines(x,y,type="s",col="red") But now, consider many scoring strategies, all randomly chosen  1 2 3 4 5 6 7 8 9  MY=matrix(NA,500,length(y)) for(i in 1:500){ S=runif(10) V=Vectorize(roc.curve)(seq(0,1,length=251)) MY[i,]=roc_curve(x) } plot(performance(prediction(S,df$y),"tpr","fpr"),col="white")  for(i in 1:500){  lines(x,MY[i,],col=rgb(0,0,1,.3),type="s") }  lines(c(0,x),c(0,apply(MY,2,mean)),col="red",type="s",lwd=3) segments(0,0,1,1,col="light blue")

The red line is the average of all random classifiers. It is not a straight line, be we observe oscillations around the diagonal.

Consider a dataset with more observations

 1 2 3 4 5 6 7 8 9   myocarde = read.table("http://freakonometrics.free.fr/myocarde.csv",head=TRUE, sep=";")  myocarde$PRONO = (myocarde$PRONO=="SURVIE")*1  reg = glm(PRONO~.,data=myocarde,family=binomial(link = "logit"))  Y=myocarde$PRONO S=predict(reg) plot(performance(prediction(S,Y),"tpr","fpr")) V=Vectorize(roc.curve)(seq(-5,5,length=251)) points(V[1,],V[2,]) segments(0,0,1,1,col="light blue") Here is a “random classifier” where we draw scores randomly on the unit interval  1 2 3 4 5  S=runif(nrow(myocarde) plot(performance(prediction(S,Y),"tpr","fpr")) V=Vectorize(roc.curve)(seq(-5,5,length=251)) points(V[1,],V[2,]) segments(0,0,1,1,col="light blue") And if we do that 500 times, we obtain, on average  1 2 3 4 5 6 7 8 9  MY=matrix(NA,500,length(y)) for(i in 1:500){ S=runif(length(Y)) V=Vectorize(roc.curve)(seq(0,1,length=251)) MY[i,]=roc_curve(x) } plot(performance(prediction(S,Y),"tpr","fpr"),col="white") for(i in 1:500){ lines(x,MY[i,],col=rgb(0,0,1,.3),type="s") } lines(c(0,x),c(0,apply(MY,2,mean)),col="red",type="s",lwd=3) segments(0,0,1,1,col="light blue") So, it looks like me might say that the diagonal is what we have, on average, when drawing randomly scores on the unit interval… I did mention that an interesting visual tool could be related to the use of the Kolmogorov Smirnov statistic on classifiers. We can plot the two empirical cumulative distribution functions of the scores, given the response $Y$  1 2 3 4 5  score=data.frame(yobs=Y, ypred=predict(reg,type="response")) f0=c(0,sort(score$ypred[score$yobs==0]),1) f1=c(0,sort(score$ypred[score$yobs==1]),1) plot(f0,(0:(length(f0)-1))/(length(f0)-1),col="red",type="s",lwd=2,xlim=0:1) lines(f1,(0:(length(f1)-1))/(length(f1)-1),col="blue",type="s",lwd=2) we can also look at the distribution of the score, with the histogram (or density estimates)  1 2 3 4 5  S=score$ypred  hist(S[Y==0],col=rgb(1,0,0,.2),  probability=TRUE,breaks=(0:10)/10,border="white")  hist(S[Y==1],col=rgb(0,0,1,.2),  probability=TRUE,breaks=(0:10)/10,border="white",add=TRUE)  lines(density(S[Y==0]),col="red",lwd=2,xlim=c(0,1))  lines(density(S[Y==1]),col="blue",lwd=2)

The underlying idea is the following : we do have a “perfect classifier” (top left corner)

is the supports of the scores do not overlap

otherwise, we should have errors. That the case below

we in 10% of the cases, we might have misclassification

or even more missclassification, with overlapping supports

Now, we have the diagonal

when the two conditional distributions of the scores are identical

Of course, that only valid when $n$ is very large, otherwise, it is only what we observe on average….

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Pear Therapeutics: Data Scientist (Analytics) [San Francisco, CA, or Boston, MA]

As a Data Scientist, you will be responsible for shaping and delivering data-driven insights. We are looking for data scientists with a deep product sense, who have an innate curiosity, and are eager to dive into large, complex datasets and create actionable insights.

### how YOU visualized it

A sampling of the many visuals created this month

The March #SWDchallenge took on a slightly different flavor. Rather than summoning you to try a certain graph type or approach, this month’s goal was effectiveness. We sought to see how people varied in answering specific questions about the same dataset. Data was sourced from AidData in partnership with Enrico Bertini, Associate Professor at NYU, who will be undertaking some data visualization research based on this challenge.

Sixty-nine people submitted a response to answer the primary question posed (“Who donates?”) and related sub-questions on the interesting patterns in the distribution across countries and recipients. With so many ways to visualize same dataset, you’ll see evidence that there isn’t a single “right” answer when it comes to how we show and communicate with data. Data can be visualized in countless different ways and by varying views of the same data, we enable our audience to see different things.

To everyone who submitted examples: THANK YOU for taking the time to create and share your work! We aren’t going to call out specific entries this month, so as not to introduce bias. By participating in this month’s challenge, you’ve helped Enrico push forward some important research. We’ll be sure to share more on that front once it’s available. The submissions below are posted below in alphabetical order and include the link to the original Tweet or interactive visual.

We encourage you to scroll through the entire post and be inspired by your peers’ approaches to this challenge! Spoiler alert: inspiration will be a central theme in the next challenge, which will be announced on April 1. Until then, check out the #SWDchallenge page for the archives of previous months' challenges and submissions. Happy browsing!

## Fiona

My approach to this challenge was to answer the question based on 2013 data, looking at who the top 10 donors are, who they have made donations to and for what causes. One thing that I have done differently was to group the purposes into broad categories to give us a rough idea on what efforts the top 10 donors are focused on contributing to. (I am looking to write a medium post about my approach soon.

## Joost

For this month’s challenge I decided to use Power BI as an opportunity to get more familiar with DAX. I focused on all donations by The Netherlands to other countries. Because there are a lot of recipients and a lot of purposes in this dataset, I decided to show only the top 5 recipients and purposes. But because this shows only a part of the picture, I also wanted to visualize the part of this top 5 in the total amount. I did this with the stacked bar chart and the accompanying text. The interactive visual can be seen here.

## Pallavi

I have visualized a simple line chart with a cumulative sum of $metric, for the Top 5 Donor countries, to answer the 'Who Donates' question. The visual plot shows that since 1973, United States has been donating the highest$. However, it also shows that Japan is closing in on the Top spot, and likely already has.

## Rodrigo

Thanks for the challenge. Developed a little mobile first visualization that allows people to select their country and check the top donors, the total cumulative over time and also how they distribute geographically.

## Xan

Click ♥ if you've made it to the bottom—this helps us know that the time it takes to pull this together is worthwhile! Check out the #SWDchallenge page for more. Thanks for reading!

### Operator Notation for Data Transforms

As of cdata version 1.0.8 cdata implements an operator notation for data transform.

The idea is simple, yet powerful.

d <- wrapr::build_frame(
"id", "measure", "value" |
1   , "AUC"    , 0.7     |
1   , "R2"     , 0.4     |
2   , "AUC"    , 0.8     |
2   , "R2"     , 0.5     )

knitr::kable(d)
id measure value
1 AUC 0.7
1 R2 0.4
2 AUC 0.8
2 R2 0.5

In the above data we have two measurements each for two individuals (individuals identified by the "id" column). Using cdata‘s new_record_spec() method we can capture a description of this record structure.

library("cdata")

record_spec <- new_record_spec(
wrapr::build_frame(
"measure", "value" |
"AUC"    , "AUC" |
"R2"     , "R2"  ),
recordKeys = "id")

print(record_spec)
## $controlTable ## measure value ## 1 AUC AUC ## 2 R2 R2 ## ##$recordKeys
## [1] "id"
##
## $controlTableKeys ## [1] "measure" ## ## attr(,"class") ## [1] "cdata_record_spec" Once we have this specification we can transform the data using operator notation. We can collect the record blocks into rows by a "factoring"/"division" (or aggregation/projection) step. knitr::kable(d) id measure value 1 AUC 0.7 1 R2 0.4 2 AUC 0.8 2 R2 0.5 d2 <- d %//% record_spec knitr::kable(d2) id AUC R2 1 0.7 0.4 2 0.8 0.5 We can expand record rows into blocks by a "multiplication" (or join) step. knitr::kable(d2) id AUC R2 1 0.7 0.4 2 0.8 0.5 d3 <- d2 %**% record_spec knitr::kable(d3) id measure value 1 AUC 0.7 1 R2 0.4 2 AUC 0.8 2 R2 0.5 And that is truly fluid data manipulation. This article can be found in a vignette here. Continue Reading… ### Machine Learning Boosts Startups and Industry BigML, the leading Machine Learning platform, and GoHub from Global Omnium join forces with a strategic partnership to boost Machine Learning adoption throughout the startup and industry sectors. This partnership helps the tech and business sectors apply Machine Learning in their companies, provides them with Machine Learning education and helps them remain competitive in the […] Continue Reading… ### Critical Thinking in Data Science (This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers) Hugo Bowne-Anderson, the host of DataFramed, the DataCamp podcast, recently interviewed Debbie Berebichez, a physicist, TV host and data scientist and is currently the Chief Data Scientist at Metis in NY. # Introducing Debbie Berebichez Hugo: Hi there, Debbie, and welcome to DataFramed. Debbie: Hi, Hugo. It’s a pleasure of mine to be here. Hugo: It is such a pleasure to have you on this show, and I’m really excited to be here today to talk in particular about critical thinking in data science, and what that actually means, and as we know, to get critical about critical thinking and to see what aspects of data science in the space, what ways we are being critical, where we can actually improve aspects of critical thinking, particularly with respect to data thinking in general. But before we get into that, I’d love to know a bit about you. So, could you start off by telling us what you’re known for in the data community? Debbie: Sure. Thank you. Well, I’m not sure I’m that well known in the data science community, but if I am, I would say it’s because I’m a big promoter of both critical thinking and of getting minorities, such as women, and especially Hispanic women, to enter the fields of STEM, including data science, and I’ve promoted and have started a bunch of initiatives geared towards getting more women to get into science, technology, and engineering. The second reason could be because I cohost a TV show for the Discovery Channel called Outrageous Acts of Science, so I know that a lot of people know me from there. Hugo: Right, and you also are at Metis, aren’t you, the data science bootcamp? Debbie: Absolutely. I was gonna say that next. So, I’m the chief data scientist at Metis. Metis is a data science training company that is part of Kaplan, the large education company, and we basically have two modes of teaching data science. One is through bootcamps, which we host in person at four locations, in New York, Chicago, San Francisco, and Seattle, and the second mode of teaching is through corporate training and other products. So, we teach live online intro to data science as a pre-bootcamp course, but we also customize various courses for corporations that need either visualization courses or Python programming or big data techniques and whatnot, and we’ve had quite a bit of success with that. Hugo: That’s great, and I look forward later on to talking about kind of the relationship to your work at Metis, bootcamps in general, how they can prepare people for a job market where … In the job market, in some respects, coding skills are at the forefront and not critical thinking skills, and how to deal with that trade-off in the education space, which is something we think a lot about. Debbie: Absolutely. Hugo: On top of that, though, you mentioned you’re a big promoter of women, in particular, Hispanic women in the space, and correct me if I’m wrong, I may have mess this up completely, but you were the first Mexican woman to get a PhD from Stanford in physics? Debbie: Wow. You didn’t get it wrong. Hugo: I got that right? Debbie: Yes. Hugo: Fantastic. Debbie: That’s right, and I think it’s an important statistic not so much to brag about it, but to show that examples like mine, of persevering and working really hard and making your dream come true exist out there, and they’re so important to talk about because they really serve as inspiration for people who sometimes think that their particular minority group or so is not suited for a career in data science or STEM. Hugo: So, is this how you got interested in data science and computation initially, through a physics PhD? Debbie: Yeah, yeah. I have kind of a, I guess, not so atypical background for data science. I did my PhD in physics at Stanford, like you said, and I did theoretical physics. I did a lot of computational work the last two years, and so I learned about models and programming and working with data. Then I moved to New York to do two postdocs at Columbia University and at the Courant Institute, part of NYU, after which I decided, like a lot of physicists, to work in Wall Street for a few years as what is sometimes derogatorily called a quant. I was involved in creating risk models, and I did a lot of data analysis, and that’s when I realized that my skills in math and programming had other alternative ways of being applied, not just in physics. Debbie: So then, after Wall Street, I thought that that was not the field for me because I didn’t really care about just making money, even though making money is nice, but I had bigger aspirations, and I wanted to do data and ethics and help the world and change the world in many ways, and so I’d heard about this new field, sort of new field for me at least, called data science about 10 years ago, and I took a course. It was kind of like a bootcamp. I had the skills, but I didn’t know how to translate them into the different techniques and algorithms that are typical of data science. So, after taking that course, I jumped ship and I started my career in data science. Hugo: Awesome. That’s a really interesting trajectory, and I just want to step back a bit, and if you don’t want to talk about this, we don’t have to, but I’m just wondering, coming from where you were in Mexico, did you have kind of a social, cultural, and even familial or parental support to go down this path? Debbie: No, I didn’t, and that is precisely why I care so much about inspiring and helping other young women who, like myself, feel attracted to a career in science or engineering, but who for some reason, whether it be financial or social, feel that they cannot achieve their dreams. From a very young age growing up in Mexico City, I was discouraged from pursuing a career in physics and math because I was a girl, and I was told by friends and parents and teachers in school that I better pick something more feminine, and that to do physics I had to practically be a genius, which I knew I wasn’t, and so they really discouraged me so much that I became insecure about my math skills and about my ability to conquer and study the field. Debbie: So, years later when it came to go to university, I picked philosophy as an undergrad because I thought that that was something similar to physics. It had a lot of questions, and you could use your imagination to ask yourself why are we here, and all kinds of things that had to do with objects that surround us and their meaning and whatnot, but I realized, Hugo, that the more I tried to hide my love for physics and math, the more that this inner voice telling me to go for it and to study it was screaming at me, until two years into the bachelor’s program in Mexico, I decided behind everyone’s back to apply to schools in the US as a transfer student, and it was difficult because in Mexico we were paying an eighth of what universities cost in the US, and especially as a foreign student, it’s very hard to find scholarships and financial help, but I was extremely lucky that I got full scholarship offered to me by Brandeis University in Massachusetts, and so in the middle of my BA in philosophy, I transferred to Brandeis in the winter. I hadn’t seen the snow before, and I picked up philosophy courses, but right in my first semester I had the courage to take my first intro to physics class. It was a very large classroom with a hundred students, and the class was astronomy 101. Debbie: In that class, I realized that my passion and my love for physics was not gonna go away, and I befriended the teaching assistant in the classroom who was a graduate student by the name of Rupesh, who came from India. He came from Darjeeling, town in the Himalayas, and Rupesh and I became friends, and we would meet all the time, and he was the first person who truly believed in me, and he told me that I wasn’t the typical student that just wanted to get an A in the homework, that my questions were just so curious, and I was so inquisitive, and that I really, really cared about knowing about the planets and quantum mechanics and statistical mechanics, and all kids of things, and so he really encouraged me to try to do physics, until one day, we were walking in Harvard Square in Cambridge, and we sat under a tree, and I looked at Rupesh with tears in my eyes, and I said, “Rupesh, I just don’t want to die without trying. I don’t want to die without trying to do physics.” Debbie: He got up, and we didn’t have cellphones at the time, but he called his advisor who was the head of the physics department at Brandeis, Dr. Wardle, who was the professor in my astronomy class, and he said, “I have a student here who has a scholarship for only two years because she’s a transfer student, and I know that BA in physics takes normally four years to complete, but she’s really, really passionate. What can we do about it?” So, Dr. Wardle called me into his office, and we had a conversation, and he basically told me, me and Rupesh, who was there with me, he said, “Believe it or not, there’s somebody else who’s done this in the past at Brandeis. His name is Ed Witten. He is-” Hugo: Wow … Debbie: I know. For those people who know physics and know who he is, he’s basically the father of string theory, so he definitely qualifies as a genius, and so I thought he was pulling my leg, like okay, Ed Witten, there’s no way I could achieve this. But he said, "Ed switched at Brandeis from history to physics, and he did it in only two years," because I couldn’t ask my family to pay for another extra two years to stay there, and so what Dr. Wardle offered is he gave me a book called Div, Grad and Curl, which is vector calculus in three dimensions, and basically, he said to me, "If by the end of the summer you’re able to master this material," and Hugo, I didn’t even remember algebra at this point- Hugo: And of course, there’s a whole bunch of linear algebra, which goes into this vector calculus. Right? Debbie: Of course. There’s so much background you have to know to even get into studying this book. So, he said, "If in two months," because this was in the month of May, "you’re able to master this material, we’ll give you a test, and we’ll let you skip through the first two years of the physics major, so you can basically finish the whole BA in only two years." So, Rupesh looked at me, and he said, "We’re gonna do this," and he decided, incredibly, to devote his entire summer from mid June to end of August to teaching me and mentoring me, and basically covering all the subjects that I needed to master in order to enter the third year of physics in September. Debbie: It was amazing because I was so incredibly hardworking and passionate that I didn’t move from my desk. Every day, Rupesh taught me from 9:00 in the morning till 9:00 p.m. We didn’t have much time, so it was just practical, knowing how to solve derivatives on Saturday. Sunday, we’ll do integrals. Monday, first three chapters of classical mechanics, and you get the idea. So, at the end of the summer I presented the test, and I passed. I tried to not burn too many capacitors in my first electronics lab at the time, and I remember how incredibly grateful I was to Rupesh, this person that absolutely changed the course of my life. Debbie: I tell this story every time I have an opportunity because it’s incredible to me what Rupesh told me. I basically always wanted to pay him for all that he dedicated to me and all the effort he put into tutoring me, and he said to me that when he was growing up in India, in Darjeeling, there was an old man who used to climb up to his little town in this mountainous terrain, and used to teach him and his sisters the tabla, the musical instrument, math and English, and every time the family wanted to compensate this old man, he said, "No. The only way you could ever pay me back is if you do this with someone else in the world." Debbie: That beautiful story is how my mission in life began, and Rupesh passed the torch of knowledge to me to inspire, help, and encourage other minorities who, like myself, dream of becoming scientists or engineers, but who for some reason lack the confidence or the skills at the time, and that has really informed my career. It has been the passion that connects everything that I’ve done, and I’m incredibly grateful to that pay-it-forward story. So, after graduating with highest honors from Brandeis is when I went to Stanford, and I reconnected with Rupesh only about seven years after that, because he had gone to the South Pole to be a submillimeter astronomer, and we connected, and he was incredibly proud that I managed to graduate and do my research with a Nobel Prize winner at Stanford, and it was a great story. # Critical Thinking and Data Science Hugo: Firstly, Debbie, thank you so much for sharing that beautiful story. Secondly, I wish I had a box of tissues with me right now, and thirdly, I feel like I was sitting there under that tree with you and Rupesh solving all the vector calculus challenges, and I want to give Rupesh a big hug and a bunch of cash right now as well, but of course, I’ll do exactly what I’m trying to do and what we need to be doing, which is paying it forward, and I think that actually provides a great segue into talking about critical thinking and data science, how we think about critical thinking as educators, being critical of critical thinking, and maybe I want to frame this conversation by saying there’s just a lot of talk around the skills aspiring data scientists, data analysts, data-fluent, data-literate people need to know, and sometimes to me, anyway, the conversation around this seems to be a little bit superficial, and I was wondering, firstly, if that’s the case for you, and secondly, if it is, what seems superficial about it? Debbie: Yes. I’m so glad you’re asking this question, Hugo. I can’t tell you how many times I have visited programs where I’ve been a mentor for high school students, and I’ll give you one example. One of these afternoon programs was receiving quite a bit of funding, and there were three groups of young girls from high school working in data science, and they had been taught SQL, so they were masters at it, much more than I was ever proficient at their age, so I was like, “Wow. These girls are really impressive.” There were three groups. They were working at a museum, and so one of them was working with a data set that was about birds in the museum, and they were trying to find patterns by looking at their demographics of the birds and their flying patterns and all this kind of information. Debbie: Another group was looking at astronomical objects, and a third group was working with turtles because the museum had a whole bunch of turtles in an exhibit. So, I went to the third group that was working with turtles, and I looked at the data that they were working with, and one of the columns said weight, so the weight of the turtles, and so I said, “Oh, wow. So, just out of curiosity, how big are the turtles that you’re working with? Have you ever seen them?” They said, “Oh, yeah, we have. They’re about the size of the palm of my hand.” I said, “Oh, cute. I’d love to see those turtles.” I said, “Okay. So, is the weight here that you have in the column … You don’t have any units for it because you just have the number, and the numbers are around 150 and 200 and 300. So, is this weight in pounds? Is it in kilograms? What is this weight in? What are the units?” Debbie: All of a sudden, these six girls in the group got all quiet, and none of them ventured to answer until one of them raised her hand and said, “Oh, I think it’s in pounds,” and I said, “Oh, wow. Let’s see. I’m about five-foot-three, and I weight probably about 120 pounds, so this is interesting because a turtle that’s the size of my hand, basically, you’re telling me it weighs double the amount of pounds that I do. Does that make sense?” Then they all laughed and said, “Oh, yeah. You’re right. It doesn’t make sense,” and we had this very nice conversation, and we went back and forth. It turns out, after an hour, we finally found a teacher who knew, and for certain, gave us the information that the weight was actually in grams. Hugo: Wow. Debbie: So, the girls were surprised, and that story really caught my attention because I had been visiting a lot of schools and programs that are trying to teach coding in a very kind of fast and superficial way, just to be able to say, "Our students know how to code," and I realized that in an effort to get more and more people to know the skills for data science and for data analysis in a world that’s going way too fast where we need to prepare our students for jobs in AI and machine learning and whatnot, we are forgetting what all of this is for. Coding and analyzing data has a purpose. It’s not an end in itself. The purpose is to be able to solve problems and to have insights about what the data is telling us. Debbie: If we’re not taught to ask the right questions and to think critically about where the data comes from, why is it being used or collected in a certain way, what other data could help or hurt my dataset, what biases are being introduced by this dataset, if we’re not teaching our kids to think what’s behind these techniques, then we’re basically failing, because we’re just making them like robots who can only perform a simple task if, and only if, the next dataset they see is similar in scope and structure to the one that they’re learning to work with. Debbie: It was a very moving, and in a way, also painful experience to see, because I realized how needed are those critical skills, and not only in the education at the high school level, but how many projects haven’t we seen at companies, at very large companies and advanced data science groups where there’s a significant bias being introduced because no one bothered to include a certain minority but important group in the statistical sample, or bias was introduced because people didn’t bother to check what some outliers in the dataset were describing et cetera. So, I’m very, very passionate about teaching the critical thinking skills that are behind our why for why we do data science. ### Collecting Data Hugo: You’ve spoken to so many essential points there. The overarching one is critical thinking, and what I like to think of, data thinking or data understanding before even … There’s a movement to put data into models and throw models at data before even looking at, as you say, units or important features, or really getting to know your data, getting to understand it, and performing that type of exploratory data analysis, and a related point that underlay a lot of what you were discussing there is thinking about the data collection process as well, and if you’re collecting data in a certain way, what are you leaving out? What are your instruments not picking up? Is your data censored for any of these reasons? Are you leaving out certain demographics because they don’t use a particular part of your service? Debbie: Mm-hmm (affirmative). Exactly. Exactly, and I think I see a lot of companies that don’t really know what data science is about, because it has become this buzzword, and everyone wants to be in it, but nobody really knows exactly what you can get out of it, and what’s happening is a lot of companies are investing significant dollar amounts in big data and solving big problems because they have collected so much data, they just build a huge infrastructure and try to find insights, but without really know if, first of all, those insights are important for the company, second of all, if they find them, would they be able to use them for something and enact policies or something that’s actually gonna be helpful for the goals of the company? I always remind them with this kind of simple example. One of my heroes in physics is Tycho Brahe, who was a very famous Danish astronomer. Basically, he was locked up in a tower in an island in Denmark, which I actually had the opportunity to visit last summer. Hugo: Oh, really? Debbie: Yes. Hugo: Wow. Debbie: He lived in the 1500s, an amazing man, but he also had a … Apparently, he was a nobleman. He had an awful personality, and he lost his nose in a duel. Hugo: They say he replaced it with a golden bridge, I think. Debbie: With a bronze- Hugo: Bronze. Yeah. Okay, great. Debbie: I think that has been discredited a bit. That’s what they told me in the museum. But anyways, yeah, this very interesting character, but the amazing thing about him is that he looked at the sky without any telescope. He basically had created these sophisticated instruments, but in the 1500s, it took him years, and he created a catalog of only about a thousand stars. That’s it. So, that’s a very, very small dataset by today’s standards, but from only those thousand data points, I think it was like 1,800 or so, to be more accurate, but he helped the theories that were later created by Kepler and Copernicus, and where the laws of planetary motion were derived. Debbie: Basically, Kepler used that, and then Isaac Newton used it as the basis for the law of gravity. So, from those thousand data points came universal theories that we’re still using today, that are incredibly powerful and deep, and that is a good example to say that sometimes we can put a lot of investment into huge datasets, but when we’re talking about data literacy, large datasets also have a lot of noise, and you have to start by teaching that the most important thing is the insight that you’re going to derive from that dataset and not its size. ### Big Data Hugo: I’d like to speak to this idea of the focus on big data and the fact that a lot of us are collecting as much data as possible, thinking that all the information we need will be contained in there, even before asking critical questions, which is very dangerous, but before that, I just want to say tangentially, Tycho Brahe and Kepler’s story is so wild. I haven’t looked into it in a while, but if I recall correctly, Kepler wanted to unlock the secrets of planetary motion and figure out what was happening, and he realized that Tycho had the data. So, this is a story of someone realizing someone else has this data, and he went to work with him in Tycho Brahe’s, I think, final years, and Tycho didn’t even give him all the data at that point. He was actually very secretive about the data he had, and even when Brahe died, Kepler had to struggle with Brahe’s family in order to get the data. So, there were all types of data secrecy and data privacy issues at that point as well. Debbie: Also, data ownership, because what- Hugo: Exactly. That’s what I meant. Yeah. Debbie: Most people know who Kepler was, but if you ask people about Tycho Brahe, very few non-science people know, and that’s because a lot of the credit went to Kepler, and some people argue that the one that did all the meticulous observations and had theories about it was Tycho, and so he deserved more credit. So, it was kind of a crazy time, and lots of fights about data were happening. Hugo: Of course, we’re talking about a decoupling or a separation of, let’s say, humans into the people who are fantastic at collecting data and the people who are fantastic at analyzing it as well. This is a division in a lot of ways. Debbie: Yeah, absolutely. Hugo: But this focus on big data, the fact that even a lot of companies’ valuations are based around the fact that they have so much data, and it must be useful in the future, right? This is incredibly dangerous for practitioners, but also for society. Debbie: Absolutely. I mean, we did have a tipping point in that we had the hope in the ’70s of AI and changing the landscape of our society, and it didn’t quite deliver in its promise because we didn’t have the capacity to analyze very, very large datasets like we do now, and there was a tipping point where now we are able to analyze these much, much larger datasets. I mean, I think every day in the world, we produce 10 to the 18 bytes of data, like 3 exabytes of data, something like that, that we generate. So, obviously these are enormous scales, but what’s important is not that we now have this capacity to analyze it, but are we really getting a significant marginal insight, or are the insights that we’re getting commensurate with the ones that we were getting when we didn’t have such large datasets? Debbie: I think that question’s still out there. We haven’t been able to answer it because, as you know, the real important applications of AI are still being created and worked on. A lot of the AI things that we see out there are still simplistic in that they don’t use all of the incredible and deep capacities that AI has to solve problems. So, dimensionality of the data matters. It matters a lot, and probably for certain problems, it’s going to be hugely important. But my point is more about when you’re educating people or when you’re a company investing in certain technology, you have to be able to walk before you run, so start analyzing the smaller datasets, come up with strategies that are based more on critical thinking, and the questions that you’re trying to solve rather than the size of your dataset, and the size of the infrastructure that you’ve built. # Top 3 Critical Thinking Skills to Learn Hugo: Great. So, I’ve got a thought experiment for you, which may happen all the time. I have no idea. But a student, an aspiring data scientist data analyst comes to you and says, "I need to learn some data thinking skills, some critical thinking skills to work with data. What are the top three critical thinking skills that you think I should learn, Debbie?" Debbie: Thanks for that question, Hugo. I think the first one is you have to be a skeptic about data. You have to always … Just like when you read a scientific paper, you have to know who paid for this research. Was it the drug company that is sponsoring a paper that says their drug is the only and best drug in the world? Clearly, I’m not gonna trust that paper. So, a healthy skepticism about the team that collected the data, what biases could have been introduced, where was this data taken, how was it collected, what things were left out, what variables would be important in the future, et cetera. All those questions I think are super important. So, if you don’t ask them before even doing exploratory data analysis, it means you’re thinking about the data, and your relationship with the data is gonna be limited. Debbie: The second one, and this one, I came up with it from another famous physicist, Richard Feynman, who said, "The ability to not fool oneself is one of the hardest and most important skills one can acquire in life," because it’s very easy … Sometimes we think, oh, I wouldn’t be fooled by anyone, not any marketing campaign, not any government is gonna fool me, but we fool ourselves much more often than the people interpreting the data out there. So, the ability to not fall in love with what we think our data should be telling us, that is what I call fooling yourself, that is super important. Debbie: The third skill is connecting the code and the algorithms to the real world, like my example with high school girls that were working with the data. To be working with a database for three months and forgetting that behind the data are actual turtles, in this example, that’s a big mistake, the same way when Facebook is incredible at doing face recognition and analyzing relationships between groups and people, but if they’re forgetting that behind those connections are real people with real lives and real consequences, then we’re failing. We need to really connect our analysis to the world out there. Hugo: I agree, and I just want to go through those again, because I’m sure our listeners are scribbling away trying to remember all of this. So, the first one was a healthy skepticism about data, the second, the ability to not fool yourself, and the third, connecting the code and the real world and all the stakeholders that actually exist on the ground. Debbie: Correct. Thank you, Hugo. # Bias Hugo: So, I just want to build slightly on the ability not to fool yourself. I mean, all of these are incredibly important, but there’s a paper called, I hope I get this right, Many Analysts, One Dataset, that we’ve discussed once or twice on the podcast before, and it gives a whole bunch of statisticians and domain experts a dataset, separates them into teams, and gives them the same dataset and asks … It’s a dataset of, I think, either yellow or red cards given to football players in football or soccer matches, and the question is, are these decisions to give cards, is there some sort of ethnic bias or a racial bias in these decisions? Hugo: The fact is, what happened was 70% of the teams said one thing, 30% said the other thing, either yes or no, and then when they got to see everyone else’s results, nearly all the teams were even more sure of their own techniques and their own results. There are a lot of reasons for this, but one of the points is that people go in with a certain bias already, and if you have a bias going into a dataset, you make all these micro-decisions as an analyst, which helps you get to the place that you already thought you were going, right? Debbie: Yeah. You reminded me, funnily, of a paper that I discussed. I don’t even think you could consider it a scientific, sophisticated paper, but it was a paper done for the astrology, not astronomy, but Astrology Association in India years ago, and I talked about it at a conference because they first decided the hypothesis is that through some astrological charts that tell you certain characteristics about some kids, if these people that were the gurus and the chart readers and predictors were able to guess, I think that they gave themself a pretty low score. They said, "If we are able to guess 60% of the outcomes," and I think the question was whether these students were intellectually gifted or just going to be average students in school, based just on their astrological chart, "and if we’re able to get 60% of them right, then that means we are gurus, and astrology is true, and we are able to predict this with very high confidence." That was their confidence level. Debbie: The funny thing is even though they did slightly worse than a coin toss, that is they got 49% of them right, and anybody in their right mind would be able to say, "Well, clearly they did even worse than chance, a toss of a coin would’ve done better," but they themselves patted themselves on the back saying, "You see? We got 49% right. We can do this." So, it’s a very funny paper, and I encourage people to read it because it’s so easy to fool ourselves. Hugo: Absolutely, and the best thing about doing worse than a coin toss is you could actually just switch all your decisions and do better. So, we’ve been talking about critical thinking at an individual and societal level. I’m wondering how you think about the needs for all these skills, critical thinking skills, how they should be spread through organizations, and what I mean is, what type of critical thinking and data thinking skills will be needed and are needed for people who don’t even work directly with data themselves, but in jobs impacted by data? Debbie: Yes. That’s an excellent question because I think the more that our field of data science grows, the more that we get different dependencies in companies, different groups needing insights or even having contact with the data, and not everybody’s going to be a data scientist. We’re gonna have people just interpret visualizations that come from the data, others using APIs and having to interpret what the algorithms come up with and whatnot. So, I think it’s essential that we spread the critical thinking message across organizations, and it has to start early in school because the ability to ask the right questions in an industry setting in incredibly important, and I don’t think we’re putting enough emphasis in it. So, I think everybody in an organization has to be trained about things such as data ethics. How is the data being collected? Are we using it for the right purpose? Data ownership, data privacy, data security, all kinds of issues that impact the manipulation of data, and so that’s part of the critical thinking process. Hugo: Hopefully, this aspect of understanding on the part of people in society and other working professionals who aren’t data scientists will result in less burden on the data scientists. What I really mean by that is … Well, there are a few ways to frame it. The first way is I think it was probably Nate Silver who said this. Any quotation I don’t know who it was, I’ll just say it’s Nate Silver, generally. But it was probably Nate Silver who said something like, "When a data scientist gets something right, they’re thought of as a god, and when they get something wrong, they’re thought of as they’ve made the worst mistakes ever," as opposed to a job in which sometimes you get it right, and sometimes you get it wrong. Hugo: Another way to frame it is it kind of viewed by people without data skills are like, "I have no idea how to deal with this, so this is what you’re going to do, and you have kind of … You’re a prophet, or you’re the holder of divine knowledge, or the high priest of data science", I like to call them, and whether this will actually help, as people develop more data skills who aren’t data scientists, will actually help bridge this gap in a lot of ways. So, how do you think about these types of issues and challenges when building data science curricula at Metis and elsewhere? Debbie: Yeah. It’s very important for me to learn … I’m not an expert in the field of learning science, but it’s very important to me to learn how to best build curriculum that optimizes these critical thinking principles and questions that I’m talking about, and so it really depends on the curriculum. So, for example, we built with a team with Cathy O’Neil, who I know you’ve interviewed before, who I love, and a group of others, seven executive women with the funding from Moody’s Analytics and the help of Girls, Inc., we developed the first data science curriculum for high school girls of under-served backgrounds, and we deployed it in New York in several high schools. Debbie: So, I think it was just this amazing experience because we try to emphasize focusing on the topic and what the consequences were of every single step in the process, from data collecting, to choosing the algorithm, to knowing how to measure the accuracy, the recall, the precision, everything that we were doing, where it comes from, how to choose the metric that was right for the problem at hand, et cetera, and so the intention was very conscious to be about how to get the most insight about the limitations and the successes of the challenge or the problem at hand. Debbie: When I build curriculum for the Metis bootcamp currently in my position, I want the students to have a pretty broad set of tools with which they can crack really hard problems. So, I may not focus on getting every single clustering algorithm there is in the curriculum, but I will focus on how to analyze the results of the clustering algorithms that we will see, and how to know if we’re using the right algorithms for the problem at hand, and how to be able to ask that question of our colleagues, of our communities, et cetera, because we all have limitations to our knowledge. # Metis Hugo: Yeah. There are two things there I want to focus on. The first is, as you said, at Metis, thinking about the actual problems, and thinking about the question at hand before even getting coding I think is incredibly important, and also, educating people through questions that really pertain to them and are interesting to them. So, students will ask me, "If I want to embark upon my first data science project, what would you suggest I do?" I say, "Well, what are you interested in," and if they have a fitness tracker, for example, I say, "Maybe you could analyze your own fitness data. If you’re a foodie, scrape Yelp reviews of restaurants and work with that type of stuff. If you love movies, if you’re a cinephile, the OMDB has a fantastic API." Debbie: That’s exactly what we do at Metis. We have our students in the bootcamp use their own dataset, and they create their own project. So, it’s really cool. I encourage people to go to madeatmetis.com, and it’s a site where we have some of our greatest projects, and it’s incredible because you see people that had very basic math and programming skills coming in, and in three months they’re able to analyze contamination sources in the ocean, or some healthcare-related thing, or an app that helps you choose the best restaurant for crepes that evening, and stuff like … It’s really, really cool what you can do. Hugo: Yeah, and I’ll build on that by saying I’ve been to several of Metis’s graduation presentations. What do you call them? Debbie: Career Day. Hugo: Yeah. They’re incredible, and seeing all the learner students there present the work they’ve done is amazing, and I know that … For example, you know I’ve had Emily Robinson on the podcast. I work with her now at DataCamp, and she completed Metis, and I think she went to Etsy straight from Metis. I could be wrong there. Debbie: Yes. We love Emily. # Future of Data Science Hugo: Yeah, incredible. So, we’re gonna wrap up in a few minutes, but I’d like to … We’ve talked about the state of play of critical thinking today, but I’d like to … It’s a prediction problem. So, what does the future of data science look like to you, Debbie? Debbie: To me, it’s going to merge with the industry of IOT or the internet of things. That is, as we see the ubiquitous sensors, that these sensors are simply everywhere, from medical devices, to buildings that are smart buildings testing our comfort level, to apps that match our behavior, it’s- Hugo: I mean, you’re right. We wear them, and we carry them in our pockets, right? Debbie: Exactly, and just like the personal computer came to revolutionize the information technology field, the same way, IOT is going to revolutionize, and we’re gonna see a new paradigm where we’re going to collect substantial more amounts of data about ourselves, our behaviors, our connections, and so issues that have to do with data privacy, data ownership, security, analysis, insights are going become evermore important. So, what I predict is that with more automation, we’re gonna have more needs to have people that are not necessarily the data scientists working with the data, but are working in the field to analyze the ethical consequences of it to act as peer reviewing committees to see if there should be policies or regulations that should be enforced around certain applications, et cetera. So, that’s what I see for a future, more and more need for sort of adjacent professions that help with the data analysis process. Hugo: Yeah, I think you’re right in terms of defining it anyway or describing it as a merging between data science and IOT and automation. I can’t quite remember, did you give a talk on the internet of things at the NYR… Jared’s conference, a few years ago? Debbie: Yes, I did at the R… Yep. Yeah. # What is your favorite data science technique? Hugo: Okay, great. Well, I loved that talk, and Jared puts all those talks up online, so I’ll find a link for that and put that in the show notes as well, if anyone’s interested. So, I want to get a bit technical. I’m wondering what one of your favorite data sciencey techniques or methodologies is, just something you love to do. Debbie: I actually really, really love singular value decomposition, SVD. I’ve always loved linear algebra, and just the thought of being able to reduce the dimensionality of a problem is so sexy to me. In physics, we deal with all the time, and my first encounter with it was when I worked briefly with David Botstein, who’s … This is many, many years ago at Stanford. He’s one of the creators of Genentech, the biotech company, and we were analyzing the data coming from DNA microarrays, which basically compare a sample of healthy DNA with a sample that came from a patient in order to conclude whether the patient had cancer, and in the case of a positive answer, what type of breast cancer it was. Debbie: So, it was really, really interesting because, obviously, there are so many genes in our genome that the dimension of the problem was humongous, and so to apply SVD and be able to reduce it to the dimensions that were most important enabled them to come up with pretty customized drugs that I have heard, because I have since stopped working in that topic, but I’ve heard are working quite well for different types of breast cancer. So, the applications of SVD are incredible, and so I don’t know, I just really like that conceptually, and anything that has to do with that, even NLP and, I don’t know, just seeing what you can get by sacrificing a bit of information is just really interesting to me. Hugo: Well, I’m sold. I mean, you’ve motivated it through linear algebra, which I also love, and then you gave some incredibly important examples of its use, and for those of you out there who know of PCA, I’d definitely suggest you to check out SVD as well. Debbie: Yeah. # Call to Action Hugo: I’ve got one final question for you. Do you have a final call to action for our listeners out there? Debbie: Yes, I do. I’ll repeat, Hugo, what I said in my Grace Hopper Celebration keynote speech a little over a year ago. Think deeply, be bold, and help others. Hugo: I think that’s fantastic, Debbie, and what we’ll do is we’ll link to your Grace Hopper talk as well, because I think the way you explained in that talk all of these things, why it’s important to think deeply, be bold, and help others, which you’ve kind of gone through this talk as well, I think that talk can provide more context there also. Debbie: Wonderful. This has been such an awesome conversation, Hugo. Thank you. Hugo: Thank you so much, Debbie. It’s been an absolute pleasure having you on the show. To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### The AI Black Box Explanation Problem Introducing Black Box AI, a system for automated decision making often based on machine learning over big data, which maps a user’s features into a class predicting the behavioural traits of the individuals. Continue Reading… ### R vs Python for Data Visualization This article demonstrates creating similar plots in R and Python using two of the most prominent data visualization packages on the market, namely ggplot2 and Seaborn. Continue Reading… ### Building a Raspberry Pi security camera with OpenCV In this tutorial, you will learn how to build a Raspberry Pi security camera using OpenCV and computer vision. The Pi security camera will be IoT capable, making it possible for our Raspberry Pi to to send TXT/MMS message notifications, images, and video clips when the security camera is triggered. Back in my undergrad years, I had an obsession with hummus. Hummus and pita/vegetables were my lunch of choice. I loved it. I lived on it. And I was very protective of my hummus — college kids are notorious for raiding each other’s fridges and stealing each other’s food. No one was to touch my hummus. But — I was a victim of such hummus theft on more than one occasion…and I never forgot it! I never figured out who stole my hummus, and even though my wife and I are the only ones who live in our house, I often hide the hummus in the back of the fridge (where no one will look) or under fruits and vegetables (which most people wouldn’t want to eat). Of course, back then I wasn’t as familiar with computer vision and OpenCV as I do now. Had I known what I do at present, I would have built a Raspberry Pi security camera to capture the hummus heist in action! Today I’m channeling my inner undergrad-self and laying rest to the chickpea bandit. And if he ever returns again, beware, my fridge is monitored! To learn how to build a security camera with a Raspberry Pi and OpenCV, just keep reading! Looking for the source code to this post? Jump right to the downloads section. ## Building a Raspberry Pi security camera with OpenCV In the first part of this tutorial, we’ll briefly review how we are going to build an IoT-capable security camera with the Raspberry Pi. Next, we’ll review our project/directory structure and install the libraries/packages to successfully build the project. We’ll also briefly review both Amazon AWS/S3 and Twilio, two services that when used together will enable us to: 1. Upload an image/video clip when the security camera is triggered. 2. Send the image/video clip directly to our smartphone via text message. From there we’ll implement the source code for the project. And finally, we’ll put all the pieces together and put our Raspberry Pi security camera into action! ### An IoT security camera with the Raspberry Pi Figure 1: Raspberry Pi + Internet of Things (IoT). Our project today will use two cloud services: Twilio and AWS S3. Twilio is an SMS/MMS messaging service. S3 is a file storage service to help facilitate the video messages. We’ll be building a very simple IoT security camera with the Raspberry Pi and OpenCV. The security camera will be capable of recording a video clip when the camera is triggered, uploading the video clip to the cloud, and then sending a TXT/MMS message which includes the video itself. We’ll be building this project specifically with the goal of detecting when a refrigerator is opened and when the fridge is closed — everything in between will be captured and recorded. Therefore, this security camera will work best in the same “open” and “closed” environment where there is a large difference in light. For example, you could also deploy this inside a mailbox that opens/closes. You can easily extend this method to work with other forms of detection, including simple motion detection and home surveillance, object detection, and more. I’ll leave that as an exercise for you, the reader, to implement — in that case, you can use this project as a “template” for implementing any additional computer vision functionality. ### Project structure Go ahead and grab the “Downloads” for today’s blog post. Once you’ve unzipped the files, you’ll be presented with the following directory structure: $ tree --dirsfirst
.
├── config
│   └── config.json
├── pyimagesearch
│   │   ├── __init__.py
│   │   └── twilionotifier.py
│   ├── utils
│   │   ├── __init__.py
│   │   └── conf.py
│   └── __init__.py
└── detect.py

4 directories, 7 files

Today we’ll be reviewing four files:

• config/config.json
: This commented JSON file holds our configuration. I’m providing you with this file, but you’ll need to insert your API keys for both Twilio and S3.
• pyimagesearch/notifications/twilionotifier.py
: Contains the
TwilioNotifier
class for sending SMS/MMS messages. This is the same exact class I use for sending text, picture, and video messages with Python inside my upcoming Raspberry Pi book.
• pyimagesearch/utils/conf.py
: The
Conf
• detect.py
: The heart of today’s project is contained in this driver script. It watches for significant light change, starts recording video, and alerts me when someone steals my hummus or anything else I’m hiding in the fridge.

Now that we understand the directory structure and files therein, let’s move on to configuring our machine and learning about S3 + Twilio. From there, we’ll begin reviewing the four key files in today’s project.

### Installing package/library prerequisites

Today’s project requires that you install a handful of Python libraries on your Raspberry Pi.

In my upcoming book, all of these packages will be preinstalled in a custom Raspbian image. All you’ll have to do is download the Raspbian .img file, flash it to your micro-SD card, and boot! From there you’ll have a pre-configured dev environment with all the computer vision + deep learning libraries you need!

Note: If you want my custom Raspbian images right now (with both OpenCV 3 and OpenCV 4), you should grab a copy of either the Quickstart Bundle or Hardcopy Bundle of Practical Python and OpenCV + Case Studies which includes the Raspbian .img file.

This introductory book will also teach you OpenCV fundamentals so that you can learn how to confidently build your own projects. These fundamentals and concepts will go a long way if you’re planning to grab my upcoming Raspberry Pi for Computer Vision book.

In the meantime, you can get by with this minimal installation of packages to replicate today’s project:

• opencv-contrib-python
: The OpenCV library.
• imutils
: My package of convenience functions and classes.
• twilio
: The Twilio package allows you to send text/picture/video messages.
• boto3
: The
boto3
package will communicate with the Amazon S3 files storage service. Our videos will be stored in S3.
• json-minify
: Allows for commented JSON files (because we all love documentation!)

To install these packages, I recommend that you follow my pip install opencv guide to setup a Python virtual environment.

You can then pip install all required packages:

$workon <env_name> # insert your environment name such as cv or py3cv4$ pip install opencv-contrib-python
$pip install imutils$ pip install twilio
$pip install boto3$ pip install json-minify

Now that our environment is configured, each time you want to activate it, simply use the

workon
command.

Let’s review S3, boto3, and Twilio!

### What is Amazon AWS and S3?

Figure 2: Amazon’s Simple Storage Service (S3) will be used to store videos captured from our IoT Raspberry Pi. We will use the boto3 Python package to work with S3.

Amazon Web Services (AWS) has a service called Simple Storage Service, commonly known as S3.

The S3 services is a highly popular service used for storing files. I actually use it to host some larger files such as GIFs on this blog.

Today we’ll be using S3 to host our video files generated by the Raspberry Pi Security camera.

S3 is organized by “buckets”. A bucket contains files and folders. It also can be set up with custom permissions and security settings.

A package called

boto3
will help us to transfer the files from our Internet of Things Raspberry Pi to AWS S3.

Before we dive into

boto3
, we need to set up an S3 bucket.

Let’s go ahead and create a bucket, resource group, and user. We’ll give the resource group permissions to access the bucket and then we’ll add the user to the resource group.

Step #1: Create a bucket

Amazon has great documentation on how to create an S3 bucket here.

Step #2: Create a resource group + user. Add the user to the resource group.

After you create your bucket, you’ll need to create an IAM user + resource group and define permissions.

• Visit the resource groups page to create a group. I named my example “s3pi”.
• Visit the users page to create a user. I named my example “raspberrypisecurity”.

Step #3: Grab your access keys. You’ll need to paste them into today’s config file.

Watch these slides to walk you through Steps 1-3, but refer to the documentation as well because slides become out of date rapidly:

Figure 3: The steps to gain API access to Amazon S3. We’ll use boto3 along with the access keys in our Raspberry Pi IoT project.

### Obtaining your Twilio API keys

Figure 4: Twilio is a popular SMS/MMS platform with a great API.

Twilio, a phone number service with an API, allows for voice, SMS, MMS, and more.

Twilio will serve as the bridge between our Raspberry Pi and our cell phone. I want to know exactly when the chickpea bandit is opening my fridge so that I can take countermeasures.

Let’s set up Twilio now.

Step #1: Create an account and get a free number.

Go ahead and sign up for Twilio and you’ll be assigned a temporary trial number. You can purchase a number + quota later if you choose to do so.

Step #2: Grab your API keys.

Now we need to obtain our API keys. Here’s a screenshot showing where to create one and copy it:

Figure 5: The Twilio API keys are necessary to send text messages with Python.

A final note about Twilio is that it does support the popular What’s App messaging platform. Support for What’s App is welcomed by the international community, however, it is currently in Beta. Today we’ll be demonstrating standard SMS/MMS only. I’ll leave it up to you to explore Twilio in conjunction with What’s App.

### Our JSON configuration file

There are a number of variables that need to be specified for this project, and instead of hardcoding them, I decided to keep our code more modular and organized by putting them in a dedicated JSON configuration file.

Since JSON doesn’t natively support comments, our

Conf
class will take advantage of JSON-minify to parse out the comments. If JSON isn’t your config file of choice, you can try YAML or XML as well.

Let’s take a look at the commented JSON file now:

{
// two constants, first threshold for detecting if the
// refrigerator is open, and a second threshold for the number of
// seconds the refrigerator is open
"thresh": 50,
"open_threshold_seconds": 60,

Lines 5 and 6 contain two settings. The first is the light threshold for determining when the refrigerator is open. The second is a threshold for the number of seconds until it is determined that someone left the door open.

Now let’s handle AWS + S3 configs:

// variables to store your aws account credentials
"aws_access_key_id": "YOUR_AWS_ACCESS_KEY_ID",
"aws_secret_access_key": "YOUR_AWS_SECRET_ACCESS_KEY",
"s3_bucket": "YOUR_AWS_S3_BUCKET",

Each of the values on Lines 9-11 are available in your AWS console (we just generated them in the “What is Amazon AWS and S3?” section above).

And finally our Twilio configs:

// variables to store your twilio account credentials
"twilio_sid": "YOUR_TWILIO_SID",
"twilio_auth": "YOUR_TWILIO_AUTH_ID",
"twilio_to": "YOUR_PHONE_NUMBER",
"twilio_from": "YOUR_TWILIO_PHONE_NUMBER"
}

Twilio security settings are on Lines 14 and 15. The

"twilio_from"
value must match one of your Twilio phone numbers. If you’re using the trial, you only have one number. If you use the wrong number, are out of quota, etc., Twilio will likely send an error message to your email address.

Phone numbers can be formatted like this in the U.S.:

"+1-555-555-5555"
.

Our configuration file includes comments (for documentation purposes) which unfortunately means we cannot use Python’s built-in

json

Instead, we’ll use a combination of JSON-minify and a custom

Conf
class to load our JSON file as a Python dictionary.

Let’s take a look at how to implement the

Conf
class now:

# import the necessary packages
from json_minify import json_minify
import json

class Conf:
def __init__(self, confPath):
# load and store the configuration and update the object's
# dictionary
self.__dict__.update(conf)

def __getitem__(self, k):
# return the value associated with the supplied key
return self.__dict__.get(k, None)

This class is relatively straightforward. Notice that in the constructor, we use

json_minify
(Line 9) to parse out the comments prior to passing the file contents to
json.loads
.

The

__getitem__
method will grab any value from the configuration with dictionary syntax. In other words, we won’t call this method directly — rather, we’ll simply use dictionary syntax in Python to grab a value associated with a given key.

Once our security camera is triggered we’ll need methods to:

• Upload the images/video to the cloud (since the Twilio API cannot directly serve “attachments”).
• Utilize the Twilio API to actually send the text message.

To keep our code neat and organized we’ll be encapsulating this functionality inside a class named

TwilioNotifier
— let’s review this class now:

# import the necessary packages
from twilio.rest import Client
import boto3

class TwilioNotifier:
def __init__(self, conf):
# store the configuration object
self.conf = conf

def send(self, msg, tempVideo):
t.start()

On Lines 2-4, we import the Twilio

Client
, Amazon’s
boto3
, and Python’s built-in
Thread
.

From there, our

TwilioNotifier
class and constructor are defined on Lines 6-9. Our constructor accepts a single parameter, the configuration, which we presume has been loaded from disk via the
Conf
class.

This project only demonstrates sending messages. We’ll be demonstrating receiving messages with Twilio in an upcoming blog post as well as in the Raspberry Pi Computer Vision book.

The

send
method is defined on Lines 11-14. This method accepts two key parameters:

• The string text
msg
• The video file,
tempVideo
. Once the video is successfully stored in S3, it will be removed from the Pi to save space. Hence it is a temporary video.

The

send
method kicks off a
Thread
to actually send the message, ensuring the main thread of execution is not blocked.

Thus, the core text message sending logic is in the next method,

_send
:

def _send(self, msg, tempVideo):
# create a s3 client object
s3 = boto3.client("s3",
aws_access_key_id=self.conf["aws_access_key_id"],
aws_secret_access_key=self.conf["aws_secret_access_key"],
)

# get the filename and upload the video in public read mode
filename = tempVideo.path[tempVideo.path.rfind("/") + 1:]
"ContentType": "video/mp4"})

The

_send
method is defined on Line 16. It operates as an independent thread so as not to impact the driver script flow.

Parameters (

msg
and
tempVideo
) are passed in when the thread is launched.

The

_send
method first will upload the video to AWS S3 via:

• Initializing the
s3
client with the access key and secret access key (Lines 18-21).

Line 24 simply extracts the

filename
from the video path since we’ll need it later.

Let’s go ahead and send the message:

# get the bucket location and build the url
location = s3.get_bucket_location(
Bucket=self.conf["s3_bucket"])["LocationConstraint"]
url = "https://s3-{}.amazonaws.com/{}/{}".format(location,
self.conf["s3_bucket"], filename)

# initialize the twilio client and send the message
client = Client(self.conf["twilio_sid"],
self.conf["twilio_auth"])
client.messages.create(to=self.conf["twilio_to"],
from_=self.conf["twilio_from"], body=msg, media_url=url)

# delete the temporary file
tempVideo.cleanup()

To send the message and have the video show up in a cell phone messaging app, we need to send the actual text string along with a URL to the video file in S3.

Note: This must be a publicly accessible URL, so ensure that your S3 settings are correct.

The URL is generated on Lines 30-33.

From there, we’ll create a Twilio

client
(not to be confused with our boto3
s3
client) on Lines 36 and 37.

Lines 38 and 39 actually send the message. Notice the

to
,
from_
,
body
, and
media_url
parameters.

Finally, we’ll remove the temporary video file to save some precious space (Line 42). If we don’t do this it’s possible that your Pi may run out of space if your disk space is already low.

### The Raspberry Pi security camera driver script

Now that we have (1) our configuration file, (2) a method to load the config, and (3) a class to interact with the S3 and Twilio APIs, let’s create the main driver script for the Raspberry Pi security camera.

The way this script works is relatively simple:

• It monitors the average amount of light seen by the camera.
• When the refrigerator door opens, the light comes on, the Pi detects the light, and the Pi starts recording.
• When the refrigerator door is closed, the light turns off, the Pi detects the absence of light, and the Pi stops recording + sends me or you a video message.
• If someone leaves the refrigerator open for longer than the specified seconds in the config file, I’ll receive a separate text message indicating that the door was left open.

Let’s go ahead and implement these features.

Open up the

detect.py
file and insert the following code:

# import the necessary packages
from __future__ import print_function
from pyimagesearch.utils import Conf
from imutils.video import VideoStream
from imutils.io import TempFile
from datetime import datetime
from datetime import date
import numpy as np
import argparse
import imutils
import signal
import time
import cv2
import sys

Lines 2-15 import our necessary packages. Notably, we’ll be using our

TwilioNotifier
,
Conf
class,
VideoStream
,
imutils
, and OpenCV.

Let’s define an interrupt signal handler and parse for our config file path argument:

# function to handle keyboard interrupt
def signal_handler(sig, frame):
print("[INFO] You pressed ctrl + c! Closing refrigerator monitor" \
" application...")
sys.exit(0)

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
help="Path to the input configuration file")
args = vars(ap.parse_args())

Our script will run headless because we don’t need an HDMI screen inside the fridge.

On Lines 18-21, we define a

signal_handler
class to capture “ctrl + c” events from the keyboard gracefully. It isn’t always necessary to do this, but if you need anything to execute before the script exits (such as someone disabling your security camera!), you can put it in this function.

We have a single command line argument to parse. The

--conf
flag (the path to config file) can be provided directly in the terminal or launch on reboot script. You may learn more about command line arguments here.

Let’s perform our initializations:

# load the configuration file and initialize the Twilio notifier
conf = Conf(args["conf"])
tn = TwilioNotifier(conf)

# initialize the flags for fridge open and notification sent
fridgeOpen = False
notifSent = False

# initialize the video stream and allow the camera sensor to warmup
print("[INFO] warming up camera...")
# vs = VideoStream(src=0).start()
vs = VideoStream(usePiCamera=True).start()
time.sleep(2.0)

# signal trap to handle keyboard interrupt
signal.signal(signal.SIGINT, signal_handler)
print("[INFO] Press ctrl + c to exit, or 'q' to quit if you have" \
" the display option on...")

# initialize the video writer and the frame dimensions (we'll set
# them as soon as we read the first frame from the video)
writer = None
W = None
H = None

Our initializations take place on Lines 30-52. Let’s review them:

• Lines 30 and 31 instantiate our
Conf
and
TwilioNotifier
objects.
• Two status variables are initialized to determine when the fridge is open and when a notification has been sent (Lines 34 and 35).
• We’ll start our
VideoStream
on Lines 39-41. I’ve elected to use a PiCamera, so Line 39 (USB webcam) is commented out. You can easily swap these if you are using a USB webcam.
• Line 44 starts our
signal_handler
thread to run in the background.
• Our video
writer
and frame dimensions are initialized on Lines 50-52.

It’s time to begin looping over frames:

# loop over the frames of the stream
while True:
# grab both the next frame from the stream and the previous
# refrigerator status
fridgePrevOpen = fridgeOpen

# quit if there was a problem grabbing a frame
if frame is None:
break

# resize the frame and convert the frame to grayscale
frame = imutils.resize(frame, width=200)
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

# if the frame dimensions are empty, set them
if W is None or H is None:
(H, W) = frame.shape[:2]

Our

while
loop begins on Line 55. We proceed to
read
a
frame
from our video stream (Line 58). The
frame
undergoes a sanity check on Lines 62 and 63 to determine if we have a legitimate image from our camera.

Line 59 sets our

fridgePrevOpen
flag. The previous value must always be set at the beginning of the loop and it is based on the current value which will be determined later.

Our

frame
is resized to a dimension that will look reasonable on a smartphone and also make for a smaller filesize for our MMS video (Line 66).

On Line 67, we create a grayscale image from

frame
— we’ll need this soon to determine the average amount of light in the frame.

Our dimensions are set via Lines 70 and 71 during the first iteration of the loop.

Now let’s determine if the refrigerator is open:

# calculate the average of all pixels where a higher mean
# indicates that there is more light coming into the refrigerator
mean = np.mean(gray)

# determine if the refrigerator is currently open
fridgeOpen = mean > conf["thresh"]

Determining if the refrigerator is open is a dead-simple, two-step process:

1. Average all pixel intensities of our grayscale image (Line 75).
2. Compare the average to the threshold value in our configuration (Line 78). I’m confident that a value of
50
(in the
config.json
file) will be an appropriate threshold for most refrigerators with a light that turns on and off as the door is opened and closed. That said, you may want to experiment with tweaking that value yourself.

The

fridgeOpen
variable is simply a boolean indicating if the refrigerator is open or not.

Let’s now determine if we need to start capturing a video:

# if the fridge is open and previously it was closed, it means
# the fridge has been just opened
if fridgeOpen and not fridgePrevOpen:
# record the start time
startTime = datetime.now()

# create a temporary video file and initialize the video
# writer object
tempVideo = TempFile(ext=".mp4")
writer = cv2.VideoWriter(tempVideo.path, 0x21, 30, (W, H),
True)

As shown by the conditional on Line 82, so long as the refrigerator was just opened (i.e. it was not previously opened), we will initialize our video

writer
.

We’ll go ahead and grab the

startTime
, create a
tempVideo
, and initialize our video
writer
with the temporary file path (Lines 84-90).

Now we’ll handle the case where the refrigerator was previously open:

# if the fridge is open then there are 2 possibilities,
# 1) it's left open for more than the *threshold* seconds.
# 2) it's closed in less than or equal to the *threshold* seconds.
elif fridgePrevOpen:
# calculate the time different between the current time and
# start time
timeDiff = (datetime.now() - startTime).seconds

# if the fridge is open and the time difference is greater
# than threshold, then send a notification
if fridgeOpen and timeDiff > conf["open_threshold_seconds"]:
# if a notification has not been sent yet, then send a
if not notifSent:
# build the message and send a notification
msg = "Intruder has left your fridge open!!!"

# release the video writer pointer and reset the
# writer object
writer.release()
writer = None

# send the message and the video to the owner and
# set the notification sent flag
tn.send(msg, tempVideo)
notifSent = True

If the refrigerator was previously open, let’s check to ensure it wasn’t left open long enough to trigger an “Intruder has left your fridge open!” alert.

Kids can leave the refrigerator open by accident, or maybe after a holiday, you have a lot of food preventing the refrigerator door from closing all the way. You don’t want your food to spoil, so you may want these alerts!

For this message to be sent, the

timeDiff
must be greater than the threshold set in the config (Lines 98-102).

This message will include a

msg
and video to you, as shown on Lines 107-117. The
msg
is defined, the
writer
is released, and the notification is set.

Let’s now take care of the most common scenario where the refrigerator was previously open, but now it is closed (i.e. some thief stole your food, or maybe it was you when you became hungry):

# check to see if the fridge is closed
elif not fridgeOpen:
# the notifSent to false for the next iteration
if notifSent:
notifSent = False

# if a notification has not been sent, then send a
else:
# record the end time and calculate the total time in
# seconds
endTime = datetime.now()
totalSeconds = (endTime - startTime).seconds
dateOpened = date.today().strftime("%A, %B %d %Y")

# build the message and send a notification
msg = "Your fridge was opened on {} at {} " \
"at {} for {} seconds.".format(dateOpened
startTime.strftime("%I:%M%p"), totalSeconds)

# release the video writer pointer and reset the
# writer object
writer.release()
writer = None

# send the message and the video to the owner
tn.send(msg, tempVideo)

The case beginning on Line 120 will send a video message indicating, “Your fridge was opened on {{ day }} at {{ time }} for {{ seconds }}.”

On Lines 123 and 124, our

notifSent
flag is reset if needed. If the notification was already sent, we set this value to
False
, effectively resetting it for the next iteration of the loop.

Otherwise, if the notification has not been sent, we’ll calculate the

totalSeconds
the refrigerator was open (Lines 131 and 132). We’ll also record the date the door was opened (Line 133).

Our

msg
string is populated with these values (Lines 136-138).

Then the video

writer
is released and the message and video are sent (Line 142-147).

Our final block finishes out the loop and performs cleanup:

# check to see if we should write the frame to disk
if writer is not None:
writer.write(frame)

# check to see if we need to release the video writer pointer
if writer is not None:
writer.release()

# cleanup the camera and close any open windows
cv2.destroyAllWindows()
vs.stop()

To finish the loop, we’ll write the

frame
to the video
writer
object and then go back to the top to grab the next frame.

When the loop exits, the

writer
is released, and the video stream is stopped.

Great job! You made it through a simple IoT project using a Raspberry Pi and camera.

It’s now time to place the bait. I know my thief likes hummus as much as I do, so I ran to the store and came back to put it in the fridge.

### RPi security camera results

Figure 6: My refrigerator is armed with an Internet of Things (IoT) Raspberry Pi, PiCamera, and Battery Pack. And of course, I’ve placed some hummus in there for me and the thief. I’ll also know if someone takes a New Belgium Dayblazer beer of mine.

When deploying the Raspberry Pi security camera in your refrigerator to catch the hummus bandit, you’ll need to ensure that it will continue to run without a wireless connection to your laptop.

There are two great options for deployment:

1. Run the computer vision Python script on reboot.
2. Leave a
screen
session running with the Python computer vision script executing within.

Be sure to visit the first link if you just want your Pi to run the script when you plug in power.

While this blog post isn’t the right place for a full screen demo, here are the basics:

• Install screen via:
sudo apt-get install screen
• Open an SSH connection to your Pi and run it:
screen
• If the connection from your laptop to your Pi ever dies or is closed, don’t panic! The screen session is still running. You can reconnect by SSH’ing into the Pi again and then running
screen -r
. You’ll be back in your virtual window.
• Keyboard shortcuts for screen:
• “ctrl + a, c”: Creates a new “window”.
• ctrl + a, p” and “ctrl + a, n”: Cycles through “previous” and “next” windows, respectively.
• For a more in-depth review of
screen
, see the documentation. Here’s a screen keyboard shortcut cheat sheet.

Once you’re comfortable with starting a script on reboot or working with

screen
, grab a USB battery pack that can source enough current. Shown in Figure 4, we’re using a RavPower 2200mAh battery pack connected to the Pi power input. The product specs claim to charge an iPhone 6+ times, and it seems to run a Raspberry Pi for about +/-10 hours (depending on the algorithm) as well.

Go ahead and plug in the battery pack, connect, and deploy the script (if you didn’t set it up to start on boot).

The commands are:

$screen # wait for screen to start$ source ~/.profile
$workon <env_name> # insert the name of your virtual environment$ python detect.py --conf config/config.json

If you aren’t familiar with command line arguments, please read this tutorial. The command line argument is also required if you are deploying the script upon reboot.

Let’s see it in action!

Figure 7: Me testing the Pi Security Camera notifications with my iPhone.

I’ve included a full deme of the Raspberry Pi security camera below:

### Interested in building more projects with the Raspberry Pi, OpenCV, and computer vision?

Figure 8: Catching a furry little raccoon with an infrared light/camera connected to the Raspberry Pi.

Are you interested in using your Raspberry Pi to build practical, real-world computer vision and deep learning applications, including:

• Computer vision and IoT projects on the Pi
• Servos, PID, and controlling the Pi with computer vision
• Human activity, home surveillance, and facial applications
• Deep learning on the Raspberry Pi
• Fast, efficient deep learning with the Movidius NCS and OpenVINO toolkit
• Self-driving car applications on the Raspberry Pi
• Tips, suggestions, and best practices when performing computer vision and deep learning with the Raspberry Pi

From there I’ll ensure you’re kept in the know on the RPi + Computer Vision book, including updates, behind the scenes looks, and release date information.

## Summary

In this tutorial, you learned how to build a Raspberry Pi security camera from scratch using OpenCV and computer vision.

Specifically, you learned how to:

• Access the Raspberry Pi camera module or USB webcam.
• Setup your Amazon AWS/S3 account so you can upload images/video when your security camera is triggered (other services such as Dropbox, Box, Google Drive, etc. will work as well, provided you can obtain a public-facing URL of the media).
• Obtain Twilio API keys used to send text messages with the uploaded images/video.
• Create a Raspberry Pi security camera using OpenCV and computer vision.

Finally, we put all the pieces together and deployed the security camera to monitor a refrigerator:

• Each time the door was opened we started recording
• After the door was closed the recording stopped
• The recording was then uploaded to the cloud
• And finally, a text message was sent to our phone showing the activity

You can extend the security camera to include other components as well. My first suggestion would be to take a look at how to build a home surveillance system using a Raspberry Pi where we use a more advanced motion detection technique. It would be fun to implement Twilio SMS/MMS notifications into the home surveillance project as well.

I hope you enjoyed this tutorial!

The post Building a Raspberry Pi security camera with OpenCV appeared first on PyImageSearch.

### Feature Reduction using Genetic Algorithm with Python

This tutorial discusses how to use the genetic algorithm (GA) for reducing the feature vector extracted from the Fruits360 dataset in Python mainly using NumPy and Sklearn.

### Mister P for surveys in epidemiology — using Stan!

Jon Zelner points us to this new article in the American Journal of Epidemiology, “Multilevel Regression and Poststratification: A Modelling Approach to Estimating Population Quantities From Highly Selected Survey Samples,” by Marnie Downes, Lyle Gurrin, Dallas English, Jane Pirkis, Dianne Currier, Matthew Spittal, and John Carlin, which begins:

Large-scale population health studies face increasing difficulties in recruiting representative samples of participants. Non-participation, item non-response and attrition, when follow-up is involved, often result in highly selected samples even in well-designed studies. We aimed to assess the potential value of multilevel regression and poststratification, a method previously used to successfully forecast US presidential election results, for addressing biases due to non-participation in the estimation of population descriptive quantities in large cohort studies. The investigation was performed as an extensive case study using a large national health survey of Australian males, the Ten to Men study. Analyses were performed in the Bayesian computational package RStan. Results showed greater consistency and precision across population subsets of varying sizes, when compared with estimates obtained using conventional survey sampling weights. Estimates for smaller population subsets exhibited a greater degree of shrinkage towards the national estimate. Multilevel regression and poststratification provides a promising analytic approach to addressing potential participation bias in the estimation of population descriptive quantities from large-scale health surveys and cohort studies.

It makes me so happy to see our methods used in new problems like this!

I’ve been dealing with all sorts of crap during the past week or so, so it’s good to be reminded of how our work can make a difference.

### quantmod_0.4-14 on CRAN

I just pushed a new release of quantmod to CRAN! I’m most excited about the update to getSymbols() so it doesn’t throw an error and stop processing if there’s a problem with one ticker symbol. Now getSymbols() will import all the data it can, and provide an informative error message for any ticker symbols it could not import.

Another cool feature is that getQuote() can now import quotes from Tiingo. But don’t thank me; thank Ethan Smith for the feature request [#247] and pull request [#250].

There are also several bug fixes in this release.  The most noticeable are fixes to getDividends()  and getSplits()Yahoo! Finance continues to have stability issues. Now it returns raw dividends instead of split-adjusted dividends (thanks to Douglas Barnard for the report [#253]), and the actual split adjustment ratio instead of the inverse (e.g. now 1/2 instead of 2/1).  I suggest using a different data provider. See my post: Yahoo! Finance Alternatives for some suggestions.

See the news file for the other bug fixes. Please let me know what you think about these changes.  I need your feedback and input to make quantmod even better!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Four short links: 25 March 2019

Hiring for Neurodiversity, Reprogrammable Molecular Computing, Retro UUCP, and Industrial Go

1. Dell's Neurodiversity Program -- excellent work from Dell making themselves an attractive destination for folks on the autistic spectrum.
2. Reprogrammable Molecular Computing System (Caltech) -- The researchers were able to experimentally demonstrate 6-bit molecular algorithms for a diverse set of tasks. In mathematics, their circuits tested inputs to assess if they were multiples of three, performed equality checks, and counted to 63. Other circuits drew "pictures" on the DNA "scarves," such as a zigzag, a double helix, and irregularly spaced diamonds. Probabilistic behaviors were also demonstrated, including random walks as well as a clever algorithm (originally developed by computer pioneer John von Neumann) for obtaining a fair 50/50 random choice from a biased coin. Paper.
3. Dataforge UUCP -- it's like Cory Doctorow guestwrote our timeline: UUCP over SSH to give decentralized comms for freedom fighters.
4. Go for Industrial Programming (Peter Bourgon) -- I’m speaking today about programming in an industrial context. By that I mean: in a startup or corporate environment; within a team where engineers come and go; on code that outlives any single engineer; and serving highly mutable business requirements. [...] I’ve tried to select for areas that have routinely tripped up new and intermediate Gophers in organizations I’ve been a part of, and particularly those things that may have nonobvious or subtle implications. (via ceej)

### Data for 200M traffic stop records

The Stanford Open Policing Project just released a dataset for police traffic stops across the country:

Currently, a comprehensive, national repository detailing interactions between police and the public doesn’t exist. That’s why the Stanford Open Policing Project is collecting and standardizing data on vehicle and pedestrian stops from law enforcement departments across the country — and we’re making that information freely available. We’ve already gathered over 200 million records from dozens of state and local police departments across the country.

You can download the data as CSV or RDS, and there are fields for stop date, stop time, location, driver demographics, and reasons for the stop. As you might imagine, the data from various municipalities comes at varying degrees of detail and timespans. I imagine there’s a lot to learn here both from the data and from working with the data.

Tags: , ,

### Distilled News

Evaluating machine learning models for bias is becoming an increasingly common focus for different industries and data researchers. Model Fairness is a relatively new subfield in Machine Learning. In the past, the study of discrimination emerged from analyzing human-driven decisions and the rationale behind those decisions. Since we started to rely on predictive ML models to make decisions for different industries such as insurance and banking, we need to implement strategies to ensure the fairness of those models and detect any discriminative behaviour during predictions.
Data generators help us create data with different distributions and profiles to experiment on. If you are testing various algorithms available to you and you want to find which one works in what cases, then these data generators can help you generate case specific data and then test the algorithm. For example you want to check whether gradient boosting trees can do well given just 100 data-points and 2 features? Now either you can search for a 100 data-points dataset, or you can use your own dataset that you are working on. But how would you know if the classifier was a good choice, given that you have so less data and doing cross validation and testing still leaves fair chance of overfitting? Or rather you could use generated data and see what usually works well for such a case, a boosting algorithm or a linear model.
The Neural Compute Stick, by Intel, is able to accelerate Tensorflow neural network inferences on the edge, improving performances by 10x factor.
1. Convert a Tensorflow model to NCS compatible one, using OpenVINO Toolkit by Intel
2. Install a light version of OpenVINO on Raspberry, to run inferences onboard
3. Test and deploy the converted model on Raspberry
» Many organizations have not seen return on their investment after developing their data and AI capabilities.
» It’s imperative to account for all of the phases of an AI solution life-cycle. Find the right business problems to solve in the Ideation phase, discover if there is a viable business model during an Experimentation phase, and scale up in an Industrialization phase.
» Actively involving the business in every step of the process and putting them in the driver’s seat is a critical element to success with data and AI.
» The analytics translator enables the execution of your company’s AI strategy by finding the right use cases, liaising between business and data experts, and embedding AI solutions into your organization.
» To be successful, an analytics translator needs deep business und
For those who had academic writing, summarization – the task of producing a concise and fluent summary while preserving key information content and overall meaning – was if not a nightmare, then a constant challenge close to guesswork to detect what the professor would find important. Though the basic idea looks simple: find the gist, cut off all opinions and detail, and write a couple of perfect sentences, the task inevitably ended up in toil and turmoil.
Recently I wanted to learn something new and challenged myself to carry out an end-to-end Market Basket Analysis. To continue to challenge myself, I’ve decided to put the results of my efforts before the eyes of the data science community. And what better forum for my first ever series of posts than one of my favourite data science blogs!
I am collecting here some notes on testing in R. There seems to be a general (false) impression among non R-core developers that to run tests, R package developers need a test management system such as RUnit or testthat. And a further false impression that testthat is the only R test management system. This is in fact not true, as R itself has a capable testing facility in ‘R CMD check’ (a command triggering R checks from outside of any given integrated development environment).
Determining the number of clusters when performing unsupervised clustering is a tricky problem. Many data sets don’t exhibit well separated clusters, and two human beings asked to visually tell the number of clusters by looking at a chart, are likely to provide two different answers. Sometimes clusters overlap with each other, and large clusters contain sub-clusters, making a decision not easy. For instance, how many clusters do you see in the picture below? What is the optimum number of clusters? No one can tell with certainty, not AI, not a human being, not an algorithm.
We are happy to introduce the vistributions package, a set of tools for visually exploring probability distributions.
Here is a bit refreshed translation of my 2015 blog post, initially published on Russian blog platform habr.com. The post shows how to organize a personal academic library of unlimited size for free. This is a funny case of a self written manual which I came back to multiple times myself and many many more times referred my friends to it, even non-Russian speakers who had to use Google Translator and infer the rest from screenshots. Finally, I decided to translate it adding some basic information on how to use Zotero with rmarkdown.
Variance decomposition and price segmentation in Insurance
Each time we have a case study in my actuarial courses (with real data), students are surprised to have hard time getting a ‘good’ model, and they are always surprised to have a low AUC, when trying to model the probability to claim a loss, to die, to fraud, etc. And each time, I keep saying, ‘yes, I know, and that’s what we expect because there a lot of ‘randomness’ in insurance’. To be more specific, I decided to run some simulations, and to compute AUCs to see what’s going on. And because I don’t want to waste time fitting models, we will assume that we have each time a perfect model. So I want to show that the upper bound of the AUC is actually quite low ! So it’s not a modeling issue, it is a fondamental issue in insurance !
1. What is a Deep Learning Framework?
2. TensorFlow
3. Keras
4. PyTorch
5. Caffe
6. Deeplearning4j
7. Comparing these Deep Learning Frameworks
Based on a geocoordinate problem posed on stackoverflow, I implemented solutions utilizing Numba: 500x faster on multiple cores, 7500x faster on GPU (RTX 2070)
Consider placing an AI bot on a blockchain and initiate the phase of deep learning, what would the end result be? Would it be detrimental to the survival of human race or would it lead to a never ending loop that removes third parties from transactions making it easier for for everyone to procure goods and services In theory, the combination of both Blockchain and AI fuse to create a foundation that can foster the change in current methods of transactions. It’s adoption rate are sluggish, the determinants of this adoption rate are more to do with the human adaptability within the financial culture along with the complexity of this conventional way of transactions.

### Document worth reading: “Learning Deep Representations for Semantic Image Parsing: a Comprehensive Overview”

Semantic image parsing, which refers to the process of decomposing images into semantic regions and constructing the structure representation of the input, has recently aroused widespread interest in the field of computer vision. The recent application of deep representation learning has driven this field into a new stage of development. In this paper, we summarize three aspects of the progress of research on semantic image parsing, i.e., category-level semantic segmentation, instance-level semantic segmentation, and beyond segmentation. Specifically, we first review the general frameworks for each task and introduce the relevant variants. The advantages and limitations of each method are also discussed. Moreover, we present a comprehensive comparison of different benchmark datasets and evaluation metrics. Finally, we explore the future trends and challenges of semantic image parsing. Learning Deep Representations for Semantic Image Parsing: a Comprehensive Overview

### Book Memo: “Hands-On Unsupervised Learning Using Python”

 How to Build Applied Machine Learning Solutions from Unlabeled Data Many industry experts consider unsupervised learning the next AI frontier, one that may hold the key to general artificial intelligence. Armed with the conceptual knowledge in this book, data scientists and machine learning practitioners will learn hands-on how to apply unsupervised learning to large unlabeled datasets using Python tools. You’ll uncover hidden patterns, gain deeper business insight, detect anomalies, perform automatic feature engineering and selection, and generate synthetic datasets. Author Ankur Patel-an applied machine-learning researcher and data scientist with expertise in financial markets-provides the concepts, intuition, and tools necessary for you to apply this technology to problems you tackle every day. Through the course of this book, you’ll learn how to build production-ready systems with Python.

### Play with the cyphr package

(This article was first published on Shige's Research Blog, and kindly contributed to R-bloggers)

The cyphr package seems to provide a good choice for small research group that shares sensitive data over internet (e.g., DropBox). I did some simple experiment myself and made sure it can actually serve my purpose.

I did my experiment on two computers (using openssl): I created the test data on my Linux workstation running Manjaro then I tried to access the data on a Windows 7 laptop.

For creating the data (Linux workstation):

library(cyphr)

# Create the test data

data_dir <- file.path(“~/Dropbox/temp_files”, “data”)
dir.create(data_dir)
dir(data_dir)

# Encrypt the test data

key <- cyphr::data_key(data_dir)

filename <- file.path(data_dir, “iris.rds”)

cyphr::encrypt(saveRDS(iris, filename), key)
dir(data_dir)

# Cannot read the data with decrypting it

# Read the decrypted version of the data

For accessing the data (Windows laptop):

library(cyphr)

key <- data_key(“C:/Users/Ssong/Dropbox/temp_files/data”, path_user = “C:/Users/Ssong/.ssh”)

# Make data access request

data_request_access(“C:/Users/Ssong/Dropbox/temp_files/data”,
path_user = “C:/Users/Ssong/.ssh”)

On Windows 7,  the system cannot locate the public located in “~/.ssh”, which is pretty dumb.

Going back to the Linux workstation to approve the data access request:

# Review the request and approve (to share with other users)

Now I can access the data on my Windows laptop:

key <- data_key(“C:/Users/Ssong/Dropbox/temp_files/data”, path_user = “C:/Users/Ssong/.ssh”)

d <- decrypt( readRDS( “C:/Users/Ssong/Dropbox/temp_files/data/iris.rds”), key)

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Jonathan (another one) does Veronica Geng does Robert Mueller

Frequent commenter Jonathan (another one) writes:

I realize that so many people bitch about the seminar showdown that you might need at one thank you. This year, I managed to re-read the bulk of Geng, and for that I thank you. I have not yet read any Sattouf, but it clearly has made an impression on you, so it’s on my list.

In thanks, my first brief foray into pseudo-Gengiana, I think I’ve got the tone roughly right, but I’m way short on whimsy, but this is what I managed in a sustained fifteen minute effort. Thanks again.

My fellow Americans:

As you are no doubt aware, I have completed my investigation and report. I write this to inform you of an unfortunate mishap from Friday. Many news outlets have reported that my final report was taken by security guard from my offices to the Justice Department. That is not true. In an attempt to maintain my obsessive secrecy, that was a dummy report, actually containing the text of an unpublished novel by David Foster Wallace that we found in Michael Cohen’s safe. We couldn’t understand it—maybe Bill Barr will have better luck.

The real one was handed to my intern, Jeff, in an ordinary interoffice envelope, and Jeff was told to drop it off at Justice on his way home. He lives nearby with six other interns. Not knowing what he had, he stopped off at the Friday Trivia Happy Hour at the Death and Taxes Pub, drank a little too much, and left the report there. We’ve gone back to look and nobody can find it.
So why not just print out another one? Or for that matter, why didn’t I just email the first report? As you’ve no doubt gleaned by now, computers and email aren’t my thing. As my successor at the FBI, Mr. Comey, demonstrated, email baffles just about all of us. And I don’t use a computer. So there isn’t another copy of the real report. I’ve got all my notes, though, so I ought to be able to cobble together a new report in a couple of months.

Apologies for the delay,
Robert Mueller

PS: Jeff has been chastised. We haven’t fired him, but in asking him about this he let slip that his parents didn’t pay taxes on the nanny who raised him and they may have strongly implied that he played on a high school curling team to get into college. His parents are going to jail and the nanny’s immigration status is being investigated. This requires a short re-opening of the investigation.

The mention of “Jeff” seems particularly Geng-like to me. Perhaps I’m reminded of “Ed.” Thinking of Geng makes me a bit sad, though, not just for her but because it reminds me of the passage of time. I associate Geng, Bill James, and Spy magazine with the mid-1980s. Ahhh, lost youth!

### R Packages worth a look

Robust Re-Scaling to Better Recover Latent Effects in Data (rrscale)
Non-linear transformations of data to better discover latent effects. Applies a sequence of three transformations (1) a Gaussianizing transformation, ( …

Create Reproducible Research Projects (rosr)
Creates reproducible academic projects with integrated academic elements, including datasets, references, codes, images, manuscripts, dissertations, sl …

Group Sequential Design Class for Clinical Trials (seqmon)
S4 class object for creating and managing group sequential designs. It calculates the efficacy and futility boundaries at each look. It allows modifyin …

Tools for Managing SSH and Git Credentials (credentials)
Setup and retrieve HTTPS and SSH credentials for use with ‘git’ and other services. For HTTPS remotes the package interfaces the ‘git-credential’ utili …

Clustering and Classification Inference with U-Statistics (uclust)
Clustering and classification inference for high dimension low sample size (HDLSS) data with U-statistics. The package contains implementations of nonp …

Portfolio Safeguard: Optimization, Statistics and Risk Management (PSGExpress)
Solves optimization, advanced statistics, and risk management problems. Popular nonlinear functions in financial, statistical, and logistics applicatio …

### Male journalists dominate the news

Two-thirds of bylines in American reporting credit men

### Summer Interns 2019

We received almost 400 applications for our 2019 internship program from students with very diverse backgrounds. After interviewing several dozen people and making some very difficult decisions, we are pleased to announce that these twelve interns have accepted positions with us for this summer:

• Therese Anders: Calibrated Peer Review. Prototype tools to conduct experiments to see whether calibrated peer review is a useful and feasible feedback strategy in introductory data science classes and industry workshops. (mentor: Mine Çetinkaya-Rundel)

• Malcolm Barrett: R Markdown Enhancements. Tidy up and refactoring the R Markdown code base. (mentor: Rich Iannone)

• Julia Blum: RStudio Community Sustainability. Study community.rstudio.com, enhance documentation and processes, and onboard new users. (mentor: Curtis Kephart)

• Joyce Cahoon: Object Scrubbers. Help write a set of methods to scrub different types of objects to reduce their size on disk. (mentors: Max Kuhn and Davis Vaughan)

• Daniel Chen: Grader Enhancements. Enhance [grader](https://github.com/rstudio-education/grader) to identify students’ mistakes when doing automated tutorials. (mentor: Garrett Grolemund)

• Marly Cormar: Production Testing Tools for Data Science Pipelines. Build on applicability domain methods from computational chemistry to create functions that can be included in a dplyr pipeline to perform statistical checks on data in production. (mentor: Max Kuhn)

• Desiree De Leon: Teaching and Learning with RStudio. Create a one-stop guide to teaching with RStudio similar to Teaching and Learning with Jupyter. (mentor: Alison Hill)

• Dewey Dunnington: ggplot2 Enhancements. Contribute to ggplot2 or an associated package (like scales) by writing R code for graphics and helping to manage a large, popular open source project. (mentor: Hadley Wickham)

• Maya Gans: Tidy Blocks. Prototype and evaluate a block-based version of the tidyverse so that young students can do simple analysis using an interface like Scratch. (mentor: Greg Wilson)

• Leslie Huang: Shiny Enhancements. Enhance Shiny’s UI, improve performance bottlenecks, fix bugs, and create a set of higher-order reactives for more sophisticated programming. (mentor: Barret Schloerke)

• Grace Lawley: Tidy Practice. Develop practice projects so learners can practice tidyverse skills using interesting real-world data. (mentor: Alison Hill)

• Yim Register: Data Science Training for Software Engineers. Develop course materials to teach basic data analysis to programmers using software engineering problems and data sets. (mentor: Greg Wilson)

We are very excited to welcome them all to the RStudio family, and we hope you’ll enjoy following their progress over the summer.

## March 24, 2019

### If you did not already know

Relational Forward Model (RFM)
The behavioral dynamics of multi-agent systems have a rich and orderly structure, which can be leveraged to understand these systems, and to improve how artificial agents learn to operate in them. Here we introduce Relational Forward Models (RFM) for multi-agent learning, networks that can learn to make accurate predictions of agents’ future behavior in multi-agent environments. Because these models operate on the discrete entities and relations present in the environment, they produce interpretable intermediate representations which offer insights into what drives agents’ behavior, and what events mediate the intensity and valence of social interactions. Furthermore, we show that embedding RFM modules inside agents results in faster learning systems compared to non-augmented baselines. As more and more of the autonomous systems we develop and interact with become multi-agent in nature, developing richer analysis tools for characterizing how and why agents make decisions is increasingly necessary. Moreover, developing artificial agents that quickly and safely learn to coordinate with one another, and with humans in shared environments, is crucial. …

Auto-Encoding Variational Bayes
How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions is two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.
GitXiv

CrossE
Knowledge graph embedding aims to learn distributed representations for entities and relations, and is proven to be effective in many applications. Crossover interactions — bi-directional effects between entities and relations — help select related information when predicting a new triple, but haven’t been formally discussed before. In this paper, we propose CrossE, a novel knowledge graph embedding which explicitly simulates crossover interactions. It not only learns one general embedding for each entity and relation as most previous methods do, but also generates multiple triple specific embeddings for both of them, named interaction embeddings. We evaluate embeddings on typical link prediction tasks and find that CrossE achieves state-of-the-art results on complex and more challenging datasets. Furthermore, we evaluate embeddings from a new perspective — giving explanations for predicted triples, which is important for real applications. In this work, an explanation for a triple is regarded as a reliable closed-path between the head and the tail entity. Compared to other baselines, we show experimentally that CrossE, benefiting from interaction embeddings, is more capable of generating reliable explanations to support its predictions. …

### Whats new on arXiv

Recent trends in neural network based text-to-speech/speech synthesis pipelines have employed recurrent Seq2seq architectures that can synthesize realistic sounding speech directly from text characters. These systems however have complex architectures and takes a substantial amount of time to train. We introduce several modifications to these Seq2seq architectures that allow for faster training time, and also allows us to reduce the complexity of the model architecture at the same time. We show that our proposed model can achieve attention alignment much faster than previous architectures and that good audio quality can be achieved with a model that’s much smaller in size. Sample audio available at https://…/tts-samples-for-cmpt-419.
Collecting, analyzing and gaining insight from large volumes of data is now the norm in an ever increasing number of industries. Data analytics techniques, such as machine learning, are powerful tools used to analyze these large volumes of data. Synthetic data sets are routinely relied upon to train and develop such data analytics methods for several reasons: to generate larger data sets than are available, to generate diverse data sets, to preserve anonymity in data sets with sensitive information, etc. Processing, transmitting and storing data is a key issue faced when handling large data sets. This paper presents an ‘On the fly’ framework for generating big synthetic data sets, suitable for these data analytics methods, that is both computationally efficient and applicable to a diverse set of problems. An example application of the proposed framework is presented along with a mathematical analysis of its computational efficiency, demonstrating its effectiveness.
Self-organization can be broadly defined as the ability of a system to display ordered spatio-temporal patterns solely as the result of the interactions among the system components. Processes of this kind characterize both living and artificial systems, making self-organization a concept that is at the basis of several disciplines, from physics to biology to engineering. Placed at the frontiers between disciplines, Artificial Life (ALife) has heavily borrowed concepts and tools from the study of self-organization, providing mechanistic interpretations of life-like phenomena as well as useful constructivist approaches to artificial system design. Despite its broad usage within ALife, the concept of self-organization has been often excessively stretched or misinterpreted, calling for a clarification that could help with tracing the borders between what can and cannot be considered self-organization. In this review, we discuss the fundamental aspects of self-organization and list the main usages within three primary ALife domains, namely ‘soft’ (mathematical/computational modeling), ‘hard’ (physical robots), and ‘wet’ (chemical/biological systems) ALife. Finally, we discuss the usefulness of self-organization within ALife studies, point to perspectives for future research, and list open questions.
The dying ReLU refers to the problem when ReLU neurons become inactive and only output 0 for any input. There are many empirical and heuristic explanations on why ReLU neurons die. However, little is known about its theoretical analysis. In this paper, we rigorously prove that a deep ReLU network will eventually die in probability as the depth goes to infinite. Several methods have been proposed to alleviate the dying ReLU. Perhaps, one of the simplest treatments is to modify the initialization procedure. One common way of initializing weights and biases uses symmetric probability distributions, which suffers from the dying ReLU. We thus propose a new initialization procedure, namely, a randomized asymmetric initialization. We prove that the new initialization can effectively prevent the dying ReLU. All parameters required for the new initialization are theoretically designed. Numerical examples are provided to demonstrate the effectiveness of the new initialization procedure.
Weather prediction today is performed with numerical weather prediction (NWP) models. These are deterministic simulation models describing the dynamics of the atmosphere, and evolving the current conditions forward in time to obtain a prediction for future atmospheric states. To account for uncertainty in NWP models it has become common practice to employ ensembles of NWP forecasts. However, NWP ensembles often exhibit forecast biases and dispersion errors, thus require statistical postprocessing to improve reliability of the ensemble forecasts. This work proposes an extension of a recently developed postprocessing model utilizing autoregressive information present in the forecast error of the raw ensemble members. The original approach is modified to let the variance parameter depend on the ensemble spread, yielding a two-fold heteroscedastic model. Furthermore, an additional high-resolution forecast is included into the postprocessing model, yielding improved predictive performance. Finally, it is outlined how the autoregressive model can be utilized to postprocess ensemble forecasts with higher forecast horizons, without the necessity of making fundamental changes to the original model. We accompany the new methodology by an implementation within the R package ensAR to make our method available for other researchers working in this area. To illustrate the performance of the heteroscedastic extension of the autoregressive model, and its use for higher forecast horizons we present a case study for a data set containing 12 years of temperature forecasts and observations over Germany. The case study indicates that the autoregressive model yields particularly strong improvements for forecast horizons beyond 24 hours.
In the present scenario of domestic flights in USA, there have been numerous instances of flight delays and cancellations. In the United States, the American Airlines, Inc. have been one of the most entrusted and the world’s largest airline in terms of number of destinations served. But when it comes to domestic flights, AA has not lived up to the expectations in terms of punctuality or on-time performance. Flight Delays also result in airline companies operating commercial flights to incur huge losses. So, they are trying their best to prevent or avoid Flight Delays and Cancellations by taking certain measures. This study aims at analyzing flight information of US domestic flights operated by American Airlines, covering top 5 busiest airports of US and predicting possible arrival delay of the flight using Data Mining and Machine Learning Approaches. The Gradient Boosting Classifier Model is deployed by training and hyper-parameter tuning it, achieving a maximum accuracy of 85.73%. Such an Intelligent System is very essential in foretelling flights’on-time performance.
Deep neural networks are widely used for nonlinear function approximation with applications ranging from computer vision to control. Although these networks involve the composition of simple arithmetic operations, it can be very challenging to verify whether a particular network satisfies certain input-output properties. This article surveys methods that have emerged recently for soundly verifying such properties. These methods borrow insights from reachability analysis, optimization, and search. We discuss fundamental differences and connections between existing algorithms. In addition, we provide pedagogical implementations of existing methods and compare them on a set of benchmark problems.
Principal Component Analysis (PCA) is one of the most important methods to handle high dimensional data. However, most of the studies on PCA aim to minimize the loss after projection, which usually measures the Euclidean distance, though in some fields, angle distance is known to be more important and critical for analysis. In this paper, we propose a method by adding constraints on factors to unify the Euclidean distance and angle distance. However, due to the nonconvexity of the objective and constraints, the optimized solution is not easy to obtain. We propose an alternating linearized minimization method to solve it with provable convergence rate and guarantee. Experiments on synthetic data and real-world datasets have validated the effectiveness of our method and demonstrated its advantages over state-of-art clustering methods.
In multi-person videos, especially team sport videos, a semantic event is usually represented as a confrontation between two teams of players, which can be represented as collective motion. In broadcast basketball videos, specific camera motions are used to present specific events. Therefore, a semantic event in broadcast basketball videos is closely related to both the global motion (camera motion) and the collective motion. A semantic event in basketball videos can be generally divided into three stages: pre-event, event occurrence (event-occ), and post-event. In this paper, we propose an ontology-based global and collective motion pattern (On_GCMP) algorithm for basketball event classification. First, a two-stage GCMP based event classification scheme is proposed. The GCMP is extracted using optical flow. The two-stage scheme progressively combines a five-class event classification algorithm on event-occs and a two-class event classification algorithm on pre-events. Both algorithms utilize sequential convolutional neural networks (CNNs) and long short-term memory (LSTM) networks to extract the spatial and temporal features of GCMP for event classification. Second, we utilize post-event segments to predict success/failure using deep features of images in the video frames (RGB_DF_VF) based algorithms. Finally the event classification results and success/failure classification results are integrated to obtain the final results. To evaluate the proposed scheme, we collected a new dataset called NCAA+, which is automatically obtained from the NCAA dataset by extending the fixed length of video clips forward and backward of the corresponding semantic events. The experimental results demonstrate that the proposed scheme achieves the mean average precision of 59.22% on NCAA+. It is higher by 7.62% than state-of-the-art on NCAA.
We consider a cognitive radio-based Internet-of-Things (CR-IoT) network consisting of one primary IoT (PIoT) system and one secondary IoT (SIoT) system. The IoT devices of both the PIoT and the SIoT respectively monitor one physical process and send randomly generated status updates to their associated access points (APs). The timeliness of the status updates is important as the systems are interested in the latest condition (e.g., temperature, speed and position) of the IoT device. In this context, two natural questions arise: (1) How to characterize the timeliness of the status updates in CR-IoT systems? (2) Which scheme, overlay or underlay, is better in terms of the timeliness of the status updates. To answer these two questions, we adopt a new performance metric, named the age of information (AoI). We analyze the average peak AoI of the PIoT and the SIoT for overlay and underlay schemes, respectively. Simple asymptotic expressions of the average peak AoI are also derived when the PIoT operates at high signal-to-noise ratio (SNR). Based on the asymptotic expressions, we characterize a critical generation rate of the PIoT system, which can determine the superiority of overlay and underlay schemes in terms of the average peak AoI of the SIoT. Numerical results validate the theoretical analysis and uncover that the overlay and underlay schemes can outperform each other in terms of the average peak AoI of the SIoT for different system setups.
Many Natural Language Processing works on emotion analysis only focus on simple emotion classification without exploring the potentials of putting emotion into ‘event context’, and ignore the analysis of emotion-related events. One main reason is the lack of this kind of corpus. Here we present Cause-Emotion-Action Corpus, which manually annotates not only emotion, but also cause events and action events. We propose two new tasks based on the data-set: emotion causality and emotion inference. The first task is to extract a triple (cause, emotion, action). The second task is to infer the probable emotion. We are currently releasing the data-set with 10,603 samples and 15,892 events, basic statistic analysis and baseline on both emotion causality and emotion inference tasks. Baseline performance demonstrates that there is much room for both tasks to be improved.
Ranking models lie at the heart of research on information retrieval (IR). During the past decades, different techniques have been proposed for constructing ranking models, from traditional heuristic methods, probabilistic methods, to modern machine learning methods. Recently, with the advance of deep learning technology, we have witnessed a growing body of work in applying shallow or deep neural networks to the ranking problem in IR, referred to as neural ranking models in this paper. The power of neural ranking models lies in the ability to learn from the raw text inputs for the ranking problem to avoid many limitations of hand-crafted features. Neural networks have sufficient capacity to model complicated tasks, which is needed to handle the complexity of relevance estimation in ranking. Since there have been a large variety of neural ranking models proposed, we believe it is the right time to summarize the current status, learn from existing methodologies, and gain some insights for future development. In contrast to existing reviews, in this survey, we will take a deep look into the neural ranking models from different dimensions to analyze their underlying assumptions, major design principles, and learning strategies. We compare these models through benchmark tasks to obtain a comprehensive empirical understanding of the existing techniques. We will also discuss what is missing in the current literature and what are the promising and desired future directions.
Unlike conventional frame-based sensors, event-based visual sensors output information through spikes at a high temporal resolution. By only encoding changes in pixel intensity, they showcase a low-power consuming, low-latency approach to visual information sensing. To use this information for higher sensory tasks like object recognition and tracking, an essential simplification step is the extraction and learning of features. An ideal feature descriptor must be robust to changes involving (i) local transformations and (ii) re-appearances of a local event pattern. To that end, we propose a novel spatiotemporal feature representation learning algorithm based on slow feature analysis (SFA). Using SFA, smoothly changing linear projections are learnt which are robust to local visual transformations. In order to determine if the features can learn to be invariant to various visual transformations, feature point tracking tasks are used for evaluation. Extensive experiments across two datasets demonstrate the adaptability of the spatiotemporal feature learner to translation, scaling and rotational transformations of the feature points. More importantly, we find that the obtained feature representations are able to exploit the high temporal resolution of such event-based cameras in generating better feature tracks.
This paper reports our efforts on swCaffe, a highly efficient parallel framework for accelerating deep neural networks (DNNs) training on Sunway TaihuLight, the current fastest supercomputer in the world that adopts a unique many-core heterogeneous architecture, with 40,960 SW26010 processors connected through a customized communication network. First, we point out some insightful principles to fully exploit the performance of the innovative many-core architecture. Second, we propose a set of optimization strategies for redesigning a variety of neural network layers based on Caffe. Third, we put forward a topology-aware parameter synchronization scheme to scale the synchronous Stochastic Gradient Descent (SGD) method to multiple processors efficiently. We evaluate our framework by training a variety of widely used neural networks with the ImageNet dataset. On a single node, swCaffe can achieve 23\%\~{}119\% overall performance compared with Caffe running on K40m GPU. As compared with the Caffe on CPU, swCaffe runs 3.04\~{}7.84x faster on all the networks. Finally, we present the scalability of swCaffe for the training of ResNet-50 and AlexNet on the scale of 1024 nodes.
We introduce the use of neural networks as classifiers on classical disordered systems with no spatial ordering. In this study, we implement a convolutional neural network trained to identify the spin-glass state in the three-dimensional Edwards-Anderson Ising spin-glass model from an input of Monte Carlo sampled configurations at a given temperature. The neural network is designed to be flexible with the input size and can accurately perform inference over a small sample of the instances in the test set. Using the neural network to classify instances of the three-dimensional Edwards-Anderson Ising spin-glass in a (random) field we show that the inferred phase boundary is consistent with the absence of an Almeida-Thouless line.
Visual Query Answering (VQA) is of great significance in offering people convenience: one can raise a question for details of objects, or high-level understanding about the scene, over an image. This paper proposes a novel method to address the VQA problem. In contrast to prior works, our method that targets single scene VQA, replies on graph-based techniques and involves reasoning. In a nutshell, our approach is centered on three graphs. The first graph, referred to as inference graph GI , is constructed via learning over labeled data. The other two graphs, referred to as query graph Q and entity-attribute graph GEA, are generated from natural language query Qnl and image Img, that are issued from users, respectively. As GEA often does not take sufficient information to answer Q, we develop techniques to infer missing information of GEA with GI . Based on GEA and Q, we provide techniques to find matches of Q in GEA, as the answer of Qnl in Img. Unlike commonly used VQA methods that are based on end-to-end neural networks, our graph-based method shows well-designed reasoning capability, and thus is highly interpretable. We also create a dataset on soccer match (Soccer-VQA) with rich annotations. The experimental results show that our approach outperforms the state-of-the-art method and has high potential for future investigation.
Federated learning on edge devices poses new challenges arising from workers that misbehave, privacy needs, etc. We propose a new robust federated optimization algorithm, with provable convergence and robustness under non-IID settings. Empirical results show that the proposed algorithm stabilizes the convergence and tolerates data poisoning on a small number of workers.
A model-based task transfer learning (MBTTL) method is presented. We consider a constrained nonlinear dynamical system and assume that a dataset of state and input pairs that solve a task T1 is available. Our objective is to find a feasible state-feedback policy for a second task, T1, by using stored data from T2. Our approach applies to tasks T2 which are composed of the same subtasks as T1, but in different order. In this paper we formally introduce the definition of subtask, the MBTTL problem and provide examples of MBTTL in the fields of autonomous cars and manipulators. Then, a computationally efficient approach to solve the MBTTL problem is presented along with proofs of feasibility for constrained linear dynamical systems. Simulation results show the effectiveness of the proposed method.
High-dimensional time series are characterized by a large number of measurements and complex dependence, and often involve abrupt change points. We propose a new procedure to detect change points in the mean of high-dimensional time series data. The proposed procedure incorporates spatial and temporal dependence of data and is able to test and estimate the change point occurred on the boundary of time series. We study its asymptotic properties under mild conditions. Simulation studies demonstrate its robust performance through the comparison with other existing methods. Our procedure is applied to an fMRI dataset.
High-dimensional data in many machine learning applications leads to computational and analytical complexities. Feature selection provides an effective way for solving these problems by removing irrelevant and redundant features, thus reducing model complexity and improving accuracy and generalization capability of the model. In this paper, we present a novel teacher-student feature selection (TSFS) method in which a ‘teacher’ (a deep neural network or a complicated dimension reduction method) is first employed to learn the best representation of data in low dimension. Then a ‘student’ network (a simple neural network) is used to perform feature selection by minimizing the reconstruction error of low dimensional representation. Although the teacher-student scheme is not new, to the best of our knowledge, it is the first time that this scheme is employed for feature selection. The proposed TSFS can be used for both supervised and unsupervised feature selection. This method is evaluated on different datasets and is compared with state-of-the-art existing feature selection methods. The results show that TSFS performs better in terms of classification and clustering accuracies and reconstruction error. Moreover, experimental evaluations demonstrate a low degree of sensitivity to parameter selection in the proposed method.
In this paper, we present an asynchronous approximate gradient method that is easy to implement called DSPG (Decentralized Simultaneous Perturbation Stochastic Approximations, with Constant Sensitivity Parameters). It is obtained by modifying SPSA (Simultaneous Perturbation Stochastic Approximations) to allow for decentralized optimization in multi-agent learning and distributed control scenarios. SPSA is a popular approximate gradient method developed by Spall, that is used in Robotics and Learning. In the multi-agent learning setup considered herein, the agents are assumed to be asynchronous (agents abide by their local clocks) and communicate via a wireless medium, that is prone to losses and delays. We analyze the gradient estimation bias that arises from setting the sensitivity parameters to a single value, and the bias that arises from communication losses and delays. Specifically, we show that these biases can be countered through better and frequent communication and/or by choosing a small fixed value for the sensitivity parameters. We also discuss the variance of the gradient estimator and its effect on the rate of convergence. Finally, we present numerical results supporting DSPG and the aforementioned theories and discussions.
Time Series Classification (TSC) problems are encountered in many real life data mining tasks ranging from medicine and security to human activity recognition and food safety. With the recent success of deep neural networks in various domains such as computer vision and natural language processing, researchers started adopting these techniques for solving time series data mining problems. However, to the best of our knowledge, no previous work has considered the vulnerability of deep learning models to adversarial time series examples, which could potentially make them unreliable in situations where the decision taken by the classifier is crucial such as in medicine and security. For computer vision problems, such attacks have been shown to be very easy to perform by altering the image and adding an imperceptible amount of noise to trick the network into wrongly classifying the input image. Following this line of work, we propose to leverage existing adversarial attack mechanisms to add a special noise to the input time series in order to decrease the network’s confidence when classifying instances at test time. Our results reveal that current state-of-the-art deep learning time series classifiers are vulnerable to adversarial attacks which can have major consequences in multiple domains such as food safety and quality assurance.
Most of the existing methods for anomaly detection use only positive data to learn the data distribution, thus they usually need a pre-defined threshold at the detection stage to determine whether a test instance is an outlier. Unfortunately, a good threshold is vital for the performance and it is really hard to find an optimal one. In this paper, we take the discriminative information implied in unlabeled data into consideration and propose a new method for anomaly detection that can learn the labels of unlabelled data directly. Our proposed method has an end-to-end architecture with one encoder and two decoders that are trained to model inliers and outliers’ data distributions in a competitive way. This architecture works in a discriminative manner without suffering from overfitting, and the training algorithm of our model is adopted from SGD, thus it is efficient and scalable even for large-scale datasets. Empirical studies on 7 datasets including KDD99, MNIST, Caltech-256, and ImageNet etc. show that our model outperforms the state-of-the-art methods.
In this work, we are motivated to make predictive functionalities native to database systems with focus on time series data. We propose a system architecture, Time Series Predict DB, that enables predictive query in any existing time series database by building an additional ‘prediction index’ for time series data. To be effective, such an index needs to be built incrementally while keeping up with database throughput, able to scale with volume of data, provide accurate predictions for heterogeneous data, and allow for ‘predictive’ querying with latency comparable to the traditional database queries. Building upon a recently developed model agnostic time series algorithm by making it incremental and scalable, we build such a system on top of PostgreSQL. Using extensive experimentation, we show that our incremental prediction index updates faster than PostgreSQL ($1\mu s$ per data for prediction index vs $4\mu s$ per data for PostgreSQL) and thus not affecting the throughput of the database. Across a variety of time series data, we find that our incremental, model agnostic algorithm provides better accuracy compared to the best state-of-art time series libraries (median improvement in range 3.29 to 4.19x over Prophet of Facebook, 1.27 to 1.48x over AMELIA in R). The latency of predictive queries with respect to SELECT queries (0.5ms) is < 1.9x (0.8ms) for imputation and < 7.6x (3ms) for forecasting across machine platforms. As a by-product, we find that the incremental, scalable variant we propose improves the accuracy of the batch prediction algorithm which may be of interest in its own right.
readPTU is a python package designed to analyze time-correlated single-photon counting data. The use of the library promotes the storage of the complete time arrival information of the photons and full flexibility in post-processing data for analysis. The library supports the computation of time resolved signal with external triggers and second order autocorrelation function analysis can be performed using multiple algorithms that provide the user with different trade-offs with regards to speed and accuracy. Additionally, a thresholding algorithm to perform time post-selection is also available. The library has been designed with performance and extensibility in mind to allow future users to implement support for additional file extensions and algorithms without having to deal with low level details. We demonstrate the performance of readPTU by analyzing the second-order autocorrelation function of the resonance fluorescence from a single quantum dot in a two-dimensional semiconductor.
It has been proved that gradient descent converges linearly to the global minima for training deep neural network in the over-parameterized regime. However, according to \citet{allen2018convergence}, the width of each layer should grow at least with the polynomial of the depth (the number of layers) for residual network (ResNet) in order to guarantee the linear convergence of gradient descent, which shows no obvious advantage over feedforward network. In this paper, we successfully remove the dependence of the width on the depth of the network for ResNet and reach a conclusion that training deep residual network can be as easy as training a two-layer network. This theoretically justifies the benefit of skip connection in terms of facilitating the convergence of gradient descent. Our experiments also justify that the width of ResNet to guarantee successful training is much smaller than that of deep feedforward neural network.
We propose a topic-guided variational autoencoder (TGVAE) model for text generation. Distinct from existing variational autoencoder (VAE) based approaches, which assume a simple Gaussian prior for the latent code, our model specifies the prior as a Gaussian mixture model (GMM) parametrized by a neural topic module. Each mixture component corresponds to a latent topic, which provides guidance to generate sentences under the topic. The neural topic module and the VAE-based neural sequence module in our model are learned jointly. In particular, a sequence of invertible Householder transformations is applied to endow the approximate posterior of the latent code with high flexibility during model inference. Experimental results show that our TGVAE outperforms alternative approaches on both unconditional and conditional text generation, which can generate semantically-meaningful sentences with various topics.

### nice student project

In all of my undergraduate classes, I require a term project, done in groups of 3-4 students. Though the topic is specified, it is largely open-ended, a level of “freedom” that many students are unaccustomed to. However, some adapt quite well. The topic this quarter was to choose a CRAN package that does not use any C/C++, and try to increase speed by converting some of the code to C/C++.

Some of the project submissions were really excellent. I decided to place one on the course Web page, and chose this one. Nice usage of Rcpp and devtools (neither of which was covered in class), very nicely presented.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Let’s get it right

Article: Are We Being Programmed?

As machine learning (ML) systems have advanced, they have acquired more power over humans’ lives, and questions about what values are embedded in them have become more complex and fraught. It is conceivable that in the coming decades, humans may succeed in creating artificial general intelligence (AGI) that thinks and acts with an open-endedness and autonomy comparable to that of humans. The implications would be profound for our species; they are now widely debated not just in science fiction and speculative research agendas but increasingly in serious technical and policy conversations. Much work is underway to try to weave ethics into advancing ML research. We think it useful to add the lens of parenting to these efforts, and specifically radical, queer theories of parenting that consciously set out to nurture agents whose experiences, objectives and understanding of the world will necessarily be very different from their parents’. We propose a spectrum of principles which might underpin such an effort; some are relevant to current ML research, while others will become more important if AGI becomes more likely. These principles may encourage new thinking about the development, design, training, and release into the world of increasingly autonomous agents.
As Artificial Intelligence (AI) becomes an integral part of our life, the development of explainable AI, embodied in the decision-making process of an AI or robotic agent, becomes imperative. For a robotic teammate, the ability to generate explanations to explain its behavior is one of the key requirements of an explainable agency. Prior work on explanation generation focuses on supporting the reasoning behind the robot’s behavior. These approaches, however, fail to consider the cognitive effort needed to understand the received explanation. In particular, the human teammate is expected to understand any explanation provided before the task execution, no matter how much information is presented in the explanation. In this work, we argue that an explanation, especially complex ones, should be made in an online fashion during the execution, which helps to spread out the information to be explained and thus reducing the cognitive load of humans. However, a challenge here is that the different parts of an explanation are dependent on each other, which must be taken into account when generating online explanations. To this end, a general formulation of online explanation generation is presented. We base our explanation generation method in a model reconciliation setting introduced in our prior work. Our approach is evaluated both with human subjects in a standard planning competition (IPC) domain, using NASA Task Load Index (TLX), as well as in simulation with four different problems.
Affective Computing is a rapidly growing field spurred by advancements in artificial intelligence, but often, held back by the inability to translate psychological theories of emotion into tractable computational models. To address this, we propose a probabilistic programming approach to affective computing, which models psychological-grounded theories as generative models of emotion, and implements them as stochastic, executable computer programs. We first review probabilistic approaches that integrate reasoning about emotions with reasoning about other latent mental states (e.g., beliefs, desires) in context. Recently-developed probabilistic programming languages offer several key desidarata over previous approaches, such as: (i) flexibility in representing emotions and emotional processes; (ii) modularity and compositionality; (iii) integration with deep learning libraries that facilitate efficient inference and learning from large, naturalistic data; and (iv) ease of adoption. Furthermore, using a probabilistic programming framework allows a standardized platform for theory-building and experimentation: Competing theories (e.g., of appraisal or other emotional processes) can be easily compared via modular substitution of code followed by model comparison. To jumpstart adoption, we illustrate our points with executable code that researchers can easily modify for their own models. We end with a discussion of applications and future directions of the probabilistic programming approach.
Press the pause button! Artificial Intelligence (AI) continues to be a growing focus in the media. An agenda gathering momentum like the cloud did, particularly in the business world. On a global path of technology innovation, AI may seem the next logical step towards progress. Computing power, storage, and processor speed have rapidly improved, and it’s now the turn of the algorithms. But what is progress? What is the cost? And is this what humanity really needs or wants? Who decides? A good place to begin is to define what AI actually is. For the purpose of this post, AI is software that, when executed, can demonstrate an element of decision-making where a programmed result may be unknown, and would typically require human intelligence to perform the decision-making task. AI usually includes an aspect of automated processing that engages one or more of the human senses i.e. sight, speech, sound, taste or smell. Recent discussions in the media, online articles and radio broadcasts sometimes blur the lines between two identifiable AI spaces:
• Near term: machines to perform faster, identify patterns, make unaided decisions, and undertake relatively complex tasks with the goal of reducing any human requirement to perform the same tasks.
• Long term: machines to potentially possess the characteristic of ‘consciousness’ – this is a different space.
Conversations can wander between the impact of these two very different visions. I recently listened to a radio discussion where a caller spoke about a cull on jobs and the knock-on effects within society, but then the caller leapt to a possibility that machines could wipe out humankind. The distinction between the two is important.
The field of machine ethics is concerned with the question of how to embed ethical behaviors, or a means to determine ethical behaviors, into artificial intelligence (AI) systems. The goal is to produce artificial moral agents (AMAs) that are either implicitly ethical (designed to avoid unethical consequences) or explicitly ethical (designed to behave ethically). Van Wynsberghe and Robbins’ (2018) paper Critiquing the Reasons for Making Artificial Moral Agents critically addresses the reasons offered by machine ethicists for pursuing AMA research; this paper, co-authored by machine ethicists and commentators, aims to contribute to the machine ethics conversation by responding to that critique. The reasons for developing AMAs discussed in van Wynsberghe and Robbins (2018) are: it is inevitable that they will be developed; the prevention of harm; the necessity for public trust; the prevention of immoral use; such machines are better moral reasoners than humans, and building these machines would lead to a better understanding of human morality. In this paper, each co-author addresses those reasons in turn. In so doing, this paper demonstrates that the reasons critiqued are not shared by all co-authors; each machine ethicist has their own reasons for researching AMAs. But while we express a diverse range of views on each of the six reasons in van Wynsberghe and Robbins’ critique, we nevertheless share the opinion that the scientific study of AMAs has considerable value.
The ethical decisions behind the acquisition and analysis of audio, video or physiological human data, harnessed for (deep) machine learning algorithms, is an increasing concern for the Artificial Intelligence (AI) community. In this regard, herein we highlight the growing need for responsible, and representative data collection and analysis, through a discussion of modality diversification. Factors such as Auditability, Benchmarking, Confidence, Data-reliance, and Explainability (ABCDE), have been touched upon within the machine learning community, and here we lay out these ABCDE sub-categories in relation to the acquisition and analysis of multimodal data, to weave through the high priority ethical concerns currently under discussion for AI. To this end, we propose how these five subcategories can be included in early planning of such acquisition paradigms.
Cancer analysis and prediction is the utmost important research field for well-being of humankind. The Cancer data are analyzed and predicted using machine learning algorithms. Most of the researcher claims the accuracy of the predicted results within 99%. However, we show that machine learning algorithms can easily predict with an accuracy of 100% on Wisconsin Diagnostic Breast Cancer dataset. We show that the method of gaining accuracy is an unethical approach that we can easily mislead the algorithms. In this paper, we exploit the weakness of Machine Learning algorithms. We perform extensive experiments for the correctness of our results to exploit the weakness of machine learning algorithms. The methods are rigorously evaluated to validate our claim. In addition, this paper focuses on correctness of accuracy. This paper report three key outcomes of the experiments, namely, correctness of accuracies, significance of minimum accuracy, and correctness of machine learning algorithms.

### Magister Dixit

“Within 10 years, data science will be so enmeshed within industry-specific applications and broad productivity tools that we may no longer think of it is a hot career. Just as generations of math and statistics students have gone on to fill all manner of roles in business and academia without thinking of themselves as mathematicians or statisticians, the newly minted data scientist grads will be tomorrow’s manufacturing engineers, marketing leaders and medical researchers.” Nate Oostendorp ( Mar 1, 2019 )

Paul Alper pointed me to this news story, “Harvard Calls for Retraction of Dozens of Studies by Noted Cardiac Researcher: Some 31 studies by Dr. Piero Anversa contain fabricated or falsified data, officials concluded. Dr. Anversa popularized the idea of stem cell treatment for damaged hearts.”

I replied: Ahhh, Harvard . . . the reporter should’ve asked Marc Hauser for a quote.

Alper responded:

Marc Hauser’s research involved “cotton-top tamarin monkeys” while Piero Anversa was falsifying and spawning research on damaged hearts:

The cardiologist rocketed to fame in 2001 with a flashy paper claiming that, contrary to scientific consensus, heart muscle could be regenerated. If true, the research would have had enormous significance for patients worldwide.

I, and I suspect that virtually all of the other contributors to your blog know nothing** about cotton-top tamarin monkeys but are fascinated and interested in stem cells and heart regeneration. Consequently, are Hauser and Anversa separated by a chasm or should they be lumped together in the Hall of Shame? Put another way, do we have yet an additional instance of crime and appropriate punishment?

**Your blog audience is so broad that there well may be cotton-top tamarin monkey mavens out there dying to hit the enter key.

Good point. It’s not up to me at all: I don’t administer punishment of any sort; as a blogger I function as a very small news organization, and my only role is to sometimes look into these cases, bring them to others’ notice, and host discussions. If it were up to me, David Weakliem and Jay Livingston would be regular New York Times columnists, and Mark Palko and Joseph Delaney would be the must-read bloggers that everyone would check each morning. Also, if it were up to me, everyone would have to post all their data and code—at least, that would be the default policy; researchers would have to give very good reasons to get out of this requirement. (Not that I always or even usually post my data and code; but I should do better too.) But none of these things are up to me.

From Harvard’s point of view, perhaps the question is whether they should go easy on people like Hauser, a person who is basically an entertainer, and whose main crime was to fake some of his entertainment—a sort of Doris Kearns Goodwin, if you will—. and be tougher on people such as Anversa, whose misdeeds can cost lives. (I don’t know where you should put someone like John Yoo who advocated for actual torture, but I suppose that someone who agreed with Yoo politically would make a similar argument against, say, old-style apologists for the Soviet Union.)

One argument for not taking people like Hauser, Wansink, etc., seriously, even in their misdeeds, is that after the flaws in their methods were revealed—after it turned out that their blithe confidence (in Wansink’s case) or attacks on whistleblowers (in Hauser’s case) were not borne out by the data—these guys just continued to say their original claims were valid. So, for them, it was never about the data at all, it was always about their stunning ideas. Or, to put it another way, the data were there to modify the details of their existing hypotheses, or to allow them to gently develop and extend their models, in a way comparable to how Philip K. Dick used the I Ching to decide what would happen next in his books. (Actually, that analogy is pretty good, as one could just as well say that Dick he used randomness not so much to “decide what would happen” but rather “to discover what would happen” next.)

Anyway, to get back to the noise-miners: The supposed empirical support was just there for them to satisfy the conventions of modern-day science. So when it turned out that the promised data had never been there . . . so what, really? The data never mattered in the first place, as these researchers implicitly admitted by not giving up on any of their substantive claims. So maybe these profs should just move into the Department of Imaginative Literature and the universities can call it a day. The medical researchers who misreport their data: That’s a bigger problem.

And what about the news media, myself included? Should I spend more time blogging about medical research and less time blogging about social science research? It’s a tough call. Social science is my own area of expertise, so I think I’m making more of a contribution by leveraging that expertise than by opining on medical research that I don’t really understand.

A related issue is accessibility: people send me more items on social science, and it takes me less effort to evaluate social science claims.

Also, I think social science is important. It does not seem that there’s any good evidence that elections are determined by shark attacks or the outcomes of college football games, or that subliminal smiley faces cause large swings in opinion, or that women’s political preferences vary greatly based on time of the month—but if any (or, lord help us, all) of these claims were true, then this would be consequential: it would “punch a big hole in democratic theory,” in the memorable words of Larry Bartels.

Monkey language and bottomless soup bowls: I don’t care about those so much. So why have I devoted so much blog space to those silly cases? Partly its from a fascination with people who refuse to admit error even when it’s staring them in the face, partly because it can give insights into general issues and statistics and science, and partly because I think people can miss the point in these cases by focusing on the drama and missing out on the statistics; see for example here and here. But mostly I write more about social science because social science is my “thing.” Just like I write more about football and baseball than about rugby and cricket.

P.S. One more thing: Don’t forget that in all these fields, social science, medical science, whatever, the problem’s is not just with bad research, cheaters, or even incompetents. No, there are big problems even with solid research done by honest researchers who are doing their best but are still using methods that misrepresent what we learn from the data. For example, the ORBITA study of heart stents, where p=0.20 (actually p=0.09 when the data were analyzed more appropriately) was widely reported as implying no effect. Honesty and transparency—and even skill and competence in the use of standard methods—are not enough. Sometimes, as in the above post, it makes sense to talk about flat-out bad research and the prominent people who do it, but that’s only one part of the story.

### ShinyProxy 2.2.0

(This article was first published on Open Analytics, and kindly contributed to R-bloggers)

ShinyProxy is a novel, open source platform to deploy Shiny apps for the enterprise
or larger organizations.

### Secured Embedding of Shiny Apps

Since version 2.0.1 ShinyProxy provides a REST API to manage (launch, shut down) Shiny apps and consume the content programmatically inside broader web applications or portals. This allows to cleanly separate the responsiblity for the Shiny apps (data science teams) and those broader applications (IT teams) while still achieving seamless integration between the two from the user’s perspective. With this release we go one step further and support the industry standard to protect REST APIs, namely OAuth 2.0.

In practice this means the following: when users of the portal log on, they typically authenticate with an OAuth2 provider (e.g. Auth0). This then allows the web application to access the ShinyProxy API on their behalf and launch the Shiny apps over the ShinyProxy API. We leave out the details on authorization codes and access tokens, but the core message is that you can now embed Shiny apps in virtually any other web application in a secure way. If you want an actual example,
please head to our Github page with ShinyProxy configuration examples, where a sample Node.js application is made available to demonstrate the full scenario.

### Miscellaneous improvements

More generally, users are deploying ShinyProxy on a wide array of cloud platforms and using a great variety of authentication technologies. A lot of the experience gained can now be found in updated documentation with additional examples on e.g. AWS Cognito (here) or Microsoft Azure AD B2C (here), next to Google and Auth0 (here). These production deployments also called for more extensive documentation on logging with ShinyProxy and at that level we also introduced a new setting logging.requestdump to enable full request dump for advanced debugging. Then, for user convenience we introduced user-friendly URLs to access an app either via the standard ShinyProxy interface (/app/) or directly (/app_direct/) if needed.

Full release notes can be found on the downloads page and updated documentation can be found on https://shinyproxy.io. As always community support on this new release is available at

https://support.openanalytics.eu

Don’t hesitate to send in questions or suggestions and have fun with ShinyProxy!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Whats new on arXiv

Distance preserving visualization techniques have emerged as one of the fundamental tools for data analysis. One example are the techniques that arrange data instances into two-dimensional grids so that the pairwise distances among the instances are preserved into the produced layouts. Currently, the state-of-the-art approaches produce such grids by solving assignment problems or using permutations to optimize cost functions. Although precise, such strategies are computationally expensive, limited to small datasets or being dependent on specialized hardware to speed up the process. In this paper, we present a new technique, called Distance-preserving Grid (DGrid), that employs a binary space partitioning process in combination with multidimensional projections to create orthogonal regular grid layouts. Our results show that DGrid is as precise as the existing state-of-the-art techniques whereas requiring only a fraction of the running time and computational resources.
This paper formulates dynamic density functions, based upon skewed-t and similar representations, to model and forecast electricity price spreads between different hours of the day. This supports an optimal day ahead storage and discharge schedule, and thereby facilitates a bidding strategy for a merchant arbitrage facility into the day-ahead auctions for wholesale electricity. The four latent moments of the density functions are dynamic and conditional upon exogenous drivers, thereby permitting the mean, variance, skewness and kurtosis of the densities to respond hourly to such factors as weather and demand forecasts. The best specification for each spread is selected based on the Pinball Loss function, following the closed form analytical solutions of the cumulative density functions. Those analytical properties also allow the calculation of risk associated with the spread arbitrages. From these spread densities, the optimal daily operation of a battery storage facility is determined.
Global fire activity has a huge impact on human lives. In recent years, many fire models have been developed to forecast fire activity. They present good results for some regions but require complex parametrizations and input variables that are not easily obtained or estimated. In this paper, we evaluate the possibility of using historical data from 2003 to 2017 of active fire detections (NASA’s MODIS MCD14ML C6) and time series forecasting methods to estimate global fire season severity (FSS), here defined as the accumulated fire detections in a season. We used a hexagonal grid to divide the globe, and we extracted time series of daily fire counts from each cell. We propose a straightforward method to estimate the fire season lengths. Our results show that in 99% of the cells, the fire seasons have lengths shorter than seven months. Given this result, we extracted the fire seasons defined as time windows of seven months centered in the months with the highest fire occurrence. We define fire season severity (FSS) as the accumulated fire detections in a season. A trend analysis suggests a global decrease in length and severity. Since FSS time series are concise, we used the monthly-accumulated fire counts (MA-FC) to train and test the seven forecasting models. Results show low forecasting errors in some areas. Therefore we conclude that many regions present predictable variations in the FSS.
We assess the performance, in terms of coverage probability and expected length, of confidence intervals centered on the bootstrap smoothed (bagged) estimator, for two nested linear regression models, with unknown error variance, and model selection using a preliminary t test.
We consider a novel question answering (QA) task where the machine needs to read from large streaming data (long documents or videos) without knowing when the questions will be given, in which case the existing QA methods fail due to lack of scalability. To tackle this problem, we propose a novel end-to-end reading comprehension method, which we refer to as Episodic Memory Reader (EMR) that sequentially reads the input contexts into an external memory, while replacing memories that are less important for answering unseen questions. Specifically, we train an RL agent to replace a memory entry when the memory is full in order to maximize its QA accuracy at a future timepoint, while encoding the external memory using the transformer architecture to learn representations that considers relative importance between the memory entries. We validate our model on a real-world large-scale textual QA task (TriviaQA) and a video QA task (TVQA), on which it achieves significant improvements over rule-based memory scheduling policies or an RL-based baseline that learns the query-specific importance of each memory independently.
Finding the best neural network architecture requires significant time, resources, and human expertise. These challenges are partially addressed by neural architecture search (NAS) which is able to find the best convolutional layer or cell that is then used as a building block for the network. However, once a good building block is found, manual design is still required to assemble the final architecture as a combination of multiple blocks under a predefined parameter budget constraint. A common solution is to stack these blocks into a single tower and adjust the width and depth to fill the parameter budget. However, these single tower architectures may not be optimal. Instead, in this paper we present the AdaNAS algorithm, that uses ensemble techniques to compose a neural network as an ensemble of smaller networks automatically. Additionally, we introduce a novel technique based on knowledge distillation to iteratively train the smaller networks using the previous ensemble as a teacher. Our experiments demonstrate that ensembles of networks improve accuracy upon a single neural network while keeping the same number of parameters. Our models achieve comparable results with the state-of-the-art on CIFAR-10 and sets a new state-of-the-art on CIFAR-100.
Humans have an incredible ability to process and understand information from multiple sources such as images, video, text, and speech. Recent success of deep neural networks has enabled us to develop algorithms which give machines the ability to understand and interpret this information. There is a need to both broaden their applicability and develop methods which correlate visual information along with semantic content. We propose a unified model which jointly trains on images and captions, and learns to generate new captions given either an image or a caption query. We evaluate our model on three different tasks namely cross-modal retrieval, image captioning, and sentence paraphrasing. Our model gains insight into cross-modal vector embeddings, generalizes well on multiple tasks and is competitive to state of the art methods on retrieval.
We consider hypothesis testing problems for a single covariate in the context of a linear model with Gaussian design when $p>n$. Under minimal sparsity conditions of their type and without any compatibility condition, we construct an asymptotically Gaussian estimator with variance equal to the oracle least-squares. The estimator is based on a weighted average of all models of a given sparsity level in the spirit of exponential weighting. We adapt this procedure to estimate the signal strength and provide a few applications. We support our results using numerical simulations based on algorithm which approximates the theoretical estimator and provide a comparison with the de-biased lasso.
In this paper, we consider groups of agents in a network that select actions in order to satisfy a set of constraints that vary arbitrarily over time and minimize a time-varying function of which they have only local observations. The selection of actions, also called a strategy, is causal and decentralized, i.e., the dynamical system that determines the actions of a given agent depends only on the constraints at the current time and on its own actions and those of its neighbors. To determine such a strategy, we propose a decentralized saddle point algorithm and show that the corresponding global fit and regret are bounded by functions of the order of $\sqrt{T}$. Specifically, we define the global fit of a strategy as a vector that integrates over time the global constraint violations as seen by a given node. The fit is a performance loss associated with online operation as opposed to offline clairvoyant operation which can always select an action if one exists, that satisfies the constraints at all times. If this fit grows sublinearly with the time horizon it suggests that the strategy approaches the feasible set of actions. Likewise, we define the regret of a strategy as the difference between its accumulated cost and that of the best fixed action that one could select knowing beforehand the time evolution of the objective function. Numerical examples support the theoretical conclusions.
Discrepancy between training and testing domains is a fundamental problem in the generalization of machine learning techniques. Recently, several approaches have been proposed to learn domain invariant feature representations through adversarial deep learning. However, label shift, where the percentage of data in each class is different between domains, has received less attention. Label shift naturally arises in many contexts, especially in behavioral studies where the behaviors are freely chosen. In this work, we propose a method called Domain Adversarial nets for Target Shift (DATS) to address label shift while learning a domain invariant representation. This is accomplished by using distribution matching to estimate label proportions in a blind test set. We extend this framework to handle multiple domains by developing a scheme to upweight source domains most similar to the target domain. Empirical results show that this framework performs well under large label shift in synthetic and real experiments, demonstrating the practical importance.
As Artificial Intelligence (AI) becomes an integral part of our life, the development of explainable AI, embodied in the decision-making process of an AI or robotic agent, becomes imperative. For a robotic teammate, the ability to generate explanations to explain its behavior is one of the key requirements of an explainable agency. Prior work on explanation generation focuses on supporting the reasoning behind the robot’s behavior. These approaches, however, fail to consider the cognitive effort needed to understand the received explanation. In particular, the human teammate is expected to understand any explanation provided before the task execution, no matter how much information is presented in the explanation. In this work, we argue that an explanation, especially complex ones, should be made in an online fashion during the execution, which helps to spread out the information to be explained and thus reducing the cognitive load of humans. However, a challenge here is that the different parts of an explanation are dependent on each other, which must be taken into account when generating online explanations. To this end, a general formulation of online explanation generation is presented. We base our explanation generation method in a model reconciliation setting introduced in our prior work. Our approach is evaluated both with human subjects in a standard planning competition (IPC) domain, using NASA Task Load Index (TLX), as well as in simulation with four different problems.
Recently, edge caching and multicasting arise as two promising technologies to support high-data-rate and low-latency delivery in wireless communication networks. In this paper, we design three transmission schemes aiming to minimize the delivery latency for cache-enabled multigroup multicasting networks. In particular, full caching bulk transmission scheme is first designed as a performance benchmark for the ideal situation where the caching capability of each enhanced remote radio head (eRRH) is sufficient large to cache all files. For the practical situation where the caching capability of each eRRH is limited, we further design two transmission schemes, namely partial caching bulk transmission (PCBT) and partial caching pipelined transmission (PCPT) schemes. In the PCBT scheme, eRRHs first fetch the uncached requested files from the baseband unit (BBU) and then all requested files are simultaneously transmitted to the users. In the PCPT scheme, eRRHs first transmit the cached requested files while fetching the uncached requested files from the BBU. Then, the remaining cached requested files and fetched uncached requested files are simultaneously transmitted to the users. The design goal of the three transmission schemes is to minimize the delivery latency, subject to some practical constraints. Efficient algorithms are developed for the low-latency cloud-edge coordinated transmission strategies. Numerical results are provided to evaluate the performance of the proposed transmission schemes and show that the PCPT scheme outperforms the PCBT scheme in terms of the delivery latency criterion.
Affective Computing is a rapidly growing field spurred by advancements in artificial intelligence, but often, held back by the inability to translate psychological theories of emotion into tractable computational models. To address this, we propose a probabilistic programming approach to affective computing, which models psychological-grounded theories as generative models of emotion, and implements them as stochastic, executable computer programs. We first review probabilistic approaches that integrate reasoning about emotions with reasoning about other latent mental states (e.g., beliefs, desires) in context. Recently-developed probabilistic programming languages offer several key desidarata over previous approaches, such as: (i) flexibility in representing emotions and emotional processes; (ii) modularity and compositionality; (iii) integration with deep learning libraries that facilitate efficient inference and learning from large, naturalistic data; and (iv) ease of adoption. Furthermore, using a probabilistic programming framework allows a standardized platform for theory-building and experimentation: Competing theories (e.g., of appraisal or other emotional processes) can be easily compared via modular substitution of code followed by model comparison. To jumpstart adoption, we illustrate our points with executable code that researchers can easily modify for their own models. We end with a discussion of applications and future directions of the probabilistic programming approach.
With the tremendous growth in the number of scientific papers being published, searching for references while writing a scientific paper is a time-consuming process. A technique that could add a reference citation at the appropriate place in a sentence will be beneficial. In this perspective, context-aware citation recommendation has been researched upon for around two decades. Many researchers have utilized the text data called the context sentence, which surrounds the citation tag, and the metadata of the target paper to find the appropriate cited research. However, the lack of well-organized benchmarking datasets and no model that can attain high performance has made the research difficult. In this paper, we propose a deep learning based model and well-organized dataset for context-aware paper citation recommendation. Our model comprises a document encoder and a context encoder, which uses Graph Convolutional Networks (GCN) layer and Bidirectional Encoder Representations from Transformers (BERT), which is a pre-trained model of textual data. By modifying the related PeerRead dataset, we propose a new dataset called FullTextPeerRead containing context sentences to cited references and paper metadata. To the best of our knowledge, This dataset is the first well-organized dataset for context-aware paper recommendation. The results indicate that the proposed model with the proposed datasets can attain state-of-the-art performance and achieve a more than 28% improvement in mean average precision (MAP) and recall@k.
Stock prices are influenced by numerous factors. We present a method to combine these factors and we validate the method by taking the international stock market as a case study. In today’s increasingly international economy, return and volatility spillover effects across international equity markets are major macroeconomic drivers of stock dynamics. Thus, foreign market information is one of the most important factors in forecasting domestic stock prices. However, the cross-correlation between domestic and foreign markets is so complex that it would be extremely difficult to express it explicitly with a dynamical equation. In this study, we develop stock return prediction models that can jointly consider international markets, using multimodal deep learning. Our contributions are three-fold: (1) we visualize the transfer information between South Korea and US stock markets using scatter plots; (2) we incorporate the information into stock prediction using multimodal deep learning; (3) we conclusively show that both early and late fusion models achieve a significant performance boost in comparison with single modality models. Our study indicates that considering international stock markets jointly can improve prediction accuracy, and deep neural networks are very effective for such tasks.
Hyperparameters and learning algorithms for neuromorphic hardware are usually chosen by hand. In contrast, the hyperparameters and learning algorithms of networks of neurons in the brain, which they aim to emulate, have been optimized through extensive evolutionary and developmental processes for specific ranges of computing and learning tasks. Occasionally this process has been emulated through genetic algorithms, but these require themselves hand-design of their details and tend to provide a limited range of improvements. We employ instead other powerful gradient-free optimization tools, such as cross-entropy methods and evolutionary strategies, in order to port the function of biological optimization processes to neuromorphic hardware. As an example, we show that this method produces neuromorphic agents that learn very efficiently from rewards. In particular, meta-plasticity, i.e., the optimization of the learning rule which they use, substantially enhances reward-based learning capability of the hardware. In addition, we demonstrate for the first time Learning-to-Learn benefits from such hardware, in particular, the capability to extract abstract knowledge from prior learning experiences that speeds up the learning of new but related tasks. Learning-to-Learn is especially suited for accelerated neuromorphic hardware, since it makes it feasible to carry out the required very large number of network computations.
Syntactic analysis plays an important role in semantic parsing, but the nature of this role remains a topic of ongoing debate. The debate has been constrained by the scarcity of empirical comparative studies between syntactic and semantic schemes, which hinders the development of parsing methods informed by the details of target schemes and constructions. We target this gap, and take Universal Dependencies (UD) and UCCA as a test case. After abstracting away from differences of convention or formalism, we find that most content divergences can be ascribed to: (1) UCCA’s distinction between a Scene and a non-Scene; (2) UCCA’s distinction between primary relations, secondary ones and participants; (3) different treatment of multi-word expressions, and (4) different treatment of inter-clause linkage. We further discuss the long tail of cases where the two schemes take markedly different approaches. Finally, we show that the proposed comparison methodology can be used for fine-grained evaluation of UCCA parsing, highlighting both challenges and potential sources for improvement. The substantial differences between the schemes suggest that semantic parsers are likely to benefit downstream text understanding applications beyond their syntactic counterparts.
We tackle the problem of finding good architectures for multimodal classification problems. We propose a novel and generic search space that spans a large number of possible fusion architectures. In order to find an optimal architecture for a given dataset in the proposed search space, we leverage an efficient sequential model-based exploration approach that is tailored for the problem. We demonstrate the value of posing multimodal fusion as a neural architecture search problem by extensive experimentation on a toy dataset and two other real multimodal datasets. We discover fusion architectures that exhibit state-of-the-art performance for problems with different domain and dataset size, including the NTU RGB+D dataset, the largest multi-modal action recognition dataset available.
Big Data is the most popular emerging trends that becomes a blessing for human kinds and it is the necessity of day-to-day life. For example, Facebook. Every person involves with producing data either directly or indirectly. Thus, Big Data is a high volume of data with exponential growth rate that consists of a variety of data. Big Data touches all fields, including Government sector, IT industry, Business, Economy, Engineering, Bioinformatics, and other basic sciences. Thus, Big Data forms a data silo. Most of the data are duplicates and unstructured. To deal with such kind of data silo, Bloom Filter is a precious resource to filter out the duplicate data. Also, Bloom Filter is inevitable in a Big Data storage system to optimize the memory consumption. Undoubtedly, Bloom Filter uses a tiny amount of memory space to filter a very large data size and it stores information of a large set of data. However, functionality of the Bloom Filter is limited to membership filter, but it can be adapted in various applications. Besides, the Bloom Filter is deployed in diverse field, and also used in the interdisciplinary research area. Bioinformatics, for instance. In this article, we expose the usefulness of Bloom Filter in Big Data research.
In standard Convolutional Neural Networks (CNNs), the receptive fields of artificial neurons in each layer are designed to share the same size. It is well-known in the neuroscience community that the receptive field size of visual cortical neurons are modulated by the stimulus, which has been rarely considered in constructing CNNs. We propose a dynamic selection mechanism in CNNs that allows each neuron to adaptively adjust its receptive field size based on multiple scales of input information. A building block called Selective Kernel (SK) unit is designed, in which multiple branches with different kernel sizes are fused using softmax attention that is guided by the information in these branches. Different attentions on these branches yield different sizes of the effective receptive fields of neurons in the fusion layer. Multiple SK units are stacked to a deep network termed Selective Kernel Networks (SKNets). On the ImageNet and CIFAR benchmarks, we empirically show that SKNet outperforms the existing state-of-the-art architectures with lower model complexity. Detailed analyses show that the neurons in SKNet can capture target objects with different scales, which verifies the capability of neurons for adaptively adjusting their recpeitve field sizes according to the input. The code and models are available at https://…/SKNet.
Deep neural networks have revolutionized many fields such as computer vision and natural language processing. Inspired by this recent success, deep learning started to show promising results for Time Series Classification (TSC). However, neural networks are still behind the state-of-the-art TSC algorithms, that are currently composed of ensembles of 37 non deep learning based classifiers. We attribute this gap in performance due to the lack of neural network ensembles for TSC. Therefore in this paper, we show how an ensemble of 60 deep learning models can significantly improve upon the current state-of-the-art performance of neural networks for TSC, when evaluated over the UCR/UEA archive: the largest publicly available benchmark for time series analysis. Finally, we show how our proposed Neural Network Ensemble (NNE) is the first time series classifier to outperform COTE while reaching similar performance to the current state-of-the-art ensemble HIVE-COTE.
This paper explores the problem of matching entities across different knowledge graphs. Given a query entity in one knowledge graph, we wish to find the corresponding real-world entity in another knowledge graph. We formalize this problem and present two large-scale datasets for this task based on exiting cross-ontology links between DBpedia and Wikidata, focused on several hundred thousand ambiguous entities. Using a classification-based approach, we find that a simple multi-layered perceptron based on representations derived from RDF2Vec graph embeddings of entities in each knowledge graph is sufficient to achieve high accuracy, with only small amounts of training data. The contributions of our work are datasets for examining this problem and strong baselines on which future work can be based.
Adversarial examples — perturbations to the input of a model that elicit large changes in the output — have been shown to be an effective way of assessing the robustness of sequence-to-sequence (seq2seq) models. However, these perturbations only indicate weaknesses in the model if they do not change the input so significantly that it legitimately results in changes in the expected output. This fact has largely been ignored in the evaluations of the growing body of related literature. Using the example of untargeted attacks on machine translation (MT), we propose a new evaluation framework for adversarial attacks on seq2seq models that takes the semantic equivalence of the pre- and post-perturbation input into account. Using this framework, we demonstrate that existing methods may not preserve meaning in general, breaking the aforementioned assumption that source side perturbations should not result in changes in the expected output. We further use this framework to demonstrate that adding additional constraints on attacks allows for adversarial perturbations that are more meaning-preserving, but nonetheless largely change the output sequence. Finally, we show that performing untargeted adversarial training with meaning-preserving attacks is beneficial to the model in terms of adversarial robustness, without hurting test performance. A toolkit implementing our evaluation framework is released at https://…/teapot-nlp.
Scaling CNN training is necessary to keep up with growing datasets and reduce training time. We also see an emerging need to handle datasets with very large samples, where memory requirements for training are large. Existing training frameworks use a data-parallel approach that partitions samples within a mini-batch, but limits to scaling the mini-batch size and memory consumption makes this untenable for large samples. We describe and implement new approaches to convolution, which parallelize using spatial decomposition or a combination of sample and spatial decomposition. This introduces many performance knobs for a network, so we develop a performance model for CNNs and present a method for using it to automatically determine efficient parallelization strategies. We evaluate our algorithms with microbenchmarks and image classification with ResNet-50. Our algorithms allow us to prototype a model for a mesh-tangling dataset, where sample sizes are very large. We show that our parallelization achieves excellent strong and weak scaling and enables training for previously unreachable datasets.
Bayesian Optimisation (BO), refers to a suite of techniques for global optimisation of expensive black box functions, which use introspective Bayesian models of the function to efficiently find the optimum. While BO has been applied successfully in many applications, modern optimisation tasks usher in new challenges where conventional methods fail spectacularly. In this work, we present Dragonfly, an open source Python library for scalable and robust BO. Dragonfly incorporates multiple recently developed methods that allow BO to be applied in challenging real world settings; these include better methods for handling higher dimensional domains, methods for handling multi-fidelity evaluations when cheap approximations of an expensive function are available, methods for optimising over structured combinatorial spaces, such as the space of neural network architectures, and methods for handling parallel evaluations. Additionally, we develop new methodological improvements in BO for selecting the Bayesian model, selecting the acquisition function, and optimising over complex domains with different variable types and additional constraints. We compare Dragonfly to a suite of other packages and algorithms for global optimisation and demonstrate that when the above methods are integrated, they enable significant improvements in the performance of BO. The Dragonfly library is available at dragonfly.github.io.

### Using R and H2O to identify product anomalies during the manufacturing process.

(This article was first published on R-Analytics, and kindly contributed to R-bloggers)

Introduction:

We will identify anomalous products on the production line by using measurements from testing stations and deep learning models. Anomalous products are not failures, these anomalies are products close to the measurement limits, so we can display warnings before the process starts to make failed products and in this way the stations get maintenance.

Before starting we need the next software installed and working:

R language installed.
– All the R packages mentioned in the R sources.
– Testing stations data, I suggest to go station by station.
H2O open source framework.
– Java 8 ( For H2O ). Open JDK: https://github.com/ojdkbuild/contrib_jdk8u-ci/releases
R studio.

About the data: Since I cannot use my real data, for this article I am using SECOM Data Set from UCI Machine Learning Repository

How many records?:
Training data set – In my real project, I use 100 thousand test passed records, it is around a month of production data.
Testing data set – I use the last 24 hours of testing station data.
Let’s the fun begin
Deep Learning Model Creation And Testing.

library( h2o )

h2o.init( nthreads = -1, max_mem_size = “5G”, port = 6666 )

h2o.removeAll() ## Removes the data from the h2o cluster in preparation for our final model.

allData = read.csv( “secom.data”, sep = ” “, header = FALSE, encoding = “UTF-8” )

# fixing the data set, there are a lot of NaN records
if( dim(na.omit(allData))[1] == 0 ){
for( colNum in 1:dim( allData )[2] ){

# Get valid values from the actual column values
ValidColumnValues = allData[,colNum][!is.nan( allData[, colNum] )]

# Check each value in the actual active column.
for( rowNum in 1:dim( allData )[1] ){

cat( “Processing row:”, rowNum, “, Column:”, colNum, “Data:”, allData[rowNum, colNum], “\n” )

if( is.nan( allData[rowNum, colNum] ) ) {

# Assign random valid value to our row,column with NA value

getValue = ValidColumnValues[ floor( runif(1, min = 1, max = length( ValidColumnValues ) ) ) ]

allData[rowNum, colNum] = getValue
}
}
}
}

# spliting all data, the fiirst 90% for training and the rest 10% for testing our model.
trainingData = allData[1:floor(dim(allData)[1]*.9),]
testingData = allData[(floor(dim(allData)[1]*.9)+1):dim(allData)[1],]

# Convert the training dataset to H2O format.

trainingData_hex = as.h2o( trainingData, destination_frame = “train_hex” )

# Set the input variables
featureNames = colnames(trainingData_hex)

# Creating the first model version.
trainingModel = h2o.deeplearning( x = featureNames

, training_frame = trainingData_hex

, model_id = “Station1DeepLearningModel”
, activation = “Tanh”
, autoencoder = TRUE
, reproducible = TRUE
, l1 = 1e-5
, ignore_const_cols = FALSE
, seed = 1234
, hidden = c( 400, 200, 400 ), epochs = 50 )

# Getting the anomalies from training data to set the min MSE( Mean Squared Error )
# value before setting a record as anomally
trainMSE = as.data.frame( h2o.anomaly( trainingModel
, trainingData_hex
, per_feature = FALSE ) )

# Check the first 30 descendent sorted trainMSE records to see our outliers
head( sort( trainMSE$Reconstruction.MSE , decreasing = TRUE ), 30) # 0.020288603 0.017976305 0.012772556 0.011556780 0.010143009 0.009524983 0.007363854 # 0.005889714 0.005604329 0.005189614[11] 0.005185285 0.005118595 0.004639442 0.004497609 # 0.004438342 0.004419993 0.004298936 0.003961503 0.003651326 0.003426971 0.003367108 # 0.003169319 0.002901914 0.002852006 0.002772110 0.002765924 0.002754586 0.002748887 # 0.002619872 0.002474702 # Ploting errors of reconstructing our training data, to have a graphical view # of our data reconstruction errors plot( sort( trainMSE$Reconstruction.MSE ), main = ‘Reconstruction Error’, ylab = “MSE Value.” )

# Seeing the chart and the first 30 decresing sorted MSE records, we can decide .01
# as our min MSE before setting a record as anomally, because we see Just a few
# records with two decimals greater than zero and we can set those as outliers.
# This value is something you must decide for your data.

# Updating trainingData data set with reconstruction error < .01
trainingDataNew = trainingData[ trainMSE$Reconstruction.MSE < .01, ] h2o.removeAll() ## Remove the data from the h2o cluster in preparation for our final model. # Convert our new training data frame to H2O format. trainingDataNew_hex = as.h2o( trainingDataNew, destination_frame = “train_hex” ) # Creating the final model. trainingModelNew = h2o.deeplearning( x = featureNames , training_frame = trainingDataNew_hex , model_id = “Station1DeepLearningModel” , activation = “Tanh” , autoencoder = TRUE , reproducible = TRUE , l1 = 1e-5 , ignore_const_cols = FALSE , seed = 1234 , hidden = c( 400, 200, 400 ), epochs = 50 ) ################################ # Check our testing data for anomalies. ################################ # Convert our testing data frame to H2O format. testingDataH2O = as.h2o( testingData, destination_frame = “test_hex” ) # Getting anomalies found in testing data. testMSE = as.data.frame( h2o.anomaly( trainingModelNew , testingDataH2O , per_feature = FALSE ) ) # Binding our data. testingData = cbind( MSE = testMSE$Reconstruction.MSE , testingData )

anomalies = testingData[ testingData$MSE >= .01, ] if( dim(anomalies)[1] ){ cat( “Anomalies detected in the sample data, station needs maintenance.” ) } Here is the code on github: https://github.com/LaranIkal/ProductAnomaliesDetection Enjoy it!!!. Carlos Kassab More information about R: To leave a comment for the author, please follow the link and comment on their blog: R-Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### R Packages worth a look Resampled Data Frames (strapgod) Create data frames with virtual groups that can be used with ‘dplyr’ to efficiently compute resampled statistics, generate the data for hypothetical ou … Binscatter Estimation and Inference (binsreg) Provides tools for statistical analysis using the binscatter methods developed by Cattaneo, Crump, Farrell and Feng (2019a) <arXiv:1902.09608> an … Bayesian Network Learning Improved Project (r.blip) Allows the user to learn Bayesian networks from datasets containing thousands of variables. It focuses on score-based learning, mainly the ‘BIC’ and th … Genetic Approach to Maximize Clustering Criterion (gama) An evolutionary approach to performing hard partitional clustering. The algorithm uses genetic operators guided by information about the quality of ind … Distance Object Manipulation Tools (disttools) Provides convenient methods for accessing the data in ‘dist’ objects with minimal memory and computational overhead. ‘disttools’ can be used to extract … Utilities to Extract and Process ‘YAML’ Fragments (yum) Provides a number of functions to facilitate extracting information in ‘YAML’ fragments from one or multiple files, optionally structuring the informat … Continue Reading… ### Document worth reading: “A Survey of the Usages of Deep Learning in Natural Language Processing” Over the last several years, the field of natural language processing has been propelled forward by an explosion in the use of deep learning models. This survey provides a brief introduction to the field and a quick overview of deep learning architectures and methods. It then sifts through the plethora of recent studies and summarizes a large assortment of relevant contributions. Analyzed research areas include several core linguistic processing issues in addition to a number of applications of computational linguistics. A discussion of the current state of the art is then provided along with recommendations for future research in the field. A Survey of the Usages of Deep Learning in Natural Language Processing Continue Reading… ### If you did not already know WebSeg In this paper, we improve semantic segmentation by automatically learning from Flickr images associated with a particular keyword, without relying on any explicit user annotations, thus substantially alleviating the dependence on accurate annotations when compared to previous weakly supervised methods. To solve such a challenging problem, we leverage several low-level cues (such as saliency, edges, etc.) to help generate a proxy ground truth. Due to the diversity of web-crawled images, we anticipate a large amount of ‘label noise’ in which other objects might be present. We design an online noise filtering scheme which is able to deal with this label noise, especially in cluttered images. We use this filtering strategy as an auxiliary module to help assist the segmentation network in learning cleaner proxy annotations. Extensive experiments on the popular PASCAL VOC 2012 semantic segmentation benchmark show surprising good results in both our WebSeg (mIoU = 57.0%) and weakly supervised (mIoU = 63.3%) settings. … Decreasing-Trend-Nature (DTN) We propose a novel diminishing learning rate scheme, coined Decreasing-Trend-Nature (DTN), which allows us to prove fast convergence of the Stochastic Gradient Descent (SGD) algorithm to a first-order stationary point for smooth general convex and some class of nonconvex including neural network applications for classification problems. We are the first to prove that SGD with diminishing learning rate achieves a convergence rate of$\mathcal{O}(1/t)\$ for these problems. Our theory applies to neural network applications for classification problems in a straightforward way. …

Super-Resolution Erlangen Database (SupER)
Capturing ground truth data to benchmark super-resolution (SR) is challenging. Therefore, current quantitative studies are mainly evaluated on simulated data artificially sampled from ground truth images. We argue that such evaluations overestimate the actual performance of SR methods compared to their behavior on real images. To bridge this simulated-to-real gap, we introduce the Super-Resolution Erlangen (SupER) database, the first comprehensive laboratory SR database of all-real acquisitions with pixel-wise ground truth. It consists of more than 80k images of 14 scenes combining different facets: CMOS sensor noise, real sampling at four resolution levels, nine scene motion types, two photometric conditions, and lossy video coding at five levels. As such, the database exceeds existing benchmarks by an order of magnitude in quality and quantity. This paper also benchmarks 19 popular single-image and multi-frame algorithms on our data. The benchmark comprises a quantitative study by exploiting ground truth data and qualitative evaluations in a large-scale observer study. We also rigorously investigate agreements between both evaluations from a statistical perspective. One interesting result is that top-performing methods on simulated data may be surpassed by others on real data. Our insights can spur further algorithm development, and the publicy available dataset can foster future evaluations. …

## March 23, 2019

### Distilled News

We live in a world that is inundated with data. Data science and machine learning (ML) techniques have come to the rescue in helping enterprises analyze and make sense of these large volumes of data. Enterprises have hired data scientists – people who apply scientific methods to data to build mathematical software models – to generate insights or predictions that enable data-driven business decisions. Typically, data scientists are experts in statistical analysis and mathematical modeling who are proficient in programming languages such as R or Python.
The R Core Team announced yesterday the release of R 3.5.3, and updated binaries for Windows and Linux are now available (with Mac sure to follow soon). This update fixes three minor bugs (to the functions writeLines, setClassUnion, and stopifnot), but you might want to upgrade just to avoid the ‘package built under R 3.5.4’ warnings you might get for new CRAN packages in the future.
On my last post I gave an intuitive demonstration of what’s causal inference and how it’s different than classic ML. After receiving some feedback I realize that while the post was easy to digest, some confusion remains. In this post I’ll delve a bit deeper into what the ‘causal’ in Causal Inference actually means.
Artificial Intelligence and Machine Learning have empowered our lives to a large extent. The number of advancements made in this space has revolutionized our society and continue making society a better place to live in. In terms of perception, both Artificial Intelligence and Machine Learning are often used in the same context which leads to confusion. AI is the concept in which machine makes smart decisions whereas Machine Learning is a sub-field of AI which makes decisions while learning patterns from the input data. In this blog, we would dissect each term and understand how Artificial Intelligence and Machine Learning are related to each other.
Francis Anscombe’s seminal paper on ‘Graphs in Statistical’ analysis (American Statistician, 1973) effectively makes the case that looking at summary statistics of data is insufficient to identify the relationship between variables. He demonstrates this by generating four different data sets (Anscombe’s quartet) which have nearly identical summary statistics. His data have the same mean and variance for x and y, same correlations between x and y, and same regression coefficients on the linear projection of x on y. (There are certainly additional summary statistics less widely reported such as kurtosis or least absolute deviations/median regression which were not reported which would have indicated differences between the data.) Yet even with these differences, without graphing the data, any analysis would likely be missing the mark.
Labelling data is typically a task for end-users and is applied in own scripts or functions rather than in packages. However, sometimes it can be useful for both end-users and package developers to have a flexible way to add variable and value labels to their data. In such cases, quasiquotation is helpful. This vignette demonstrate how to use quasiquotation in sjlabelled to label your data.
The goal of this project is to find out the similarities within groups of people in order to build a movie recommending system for users. We are going to analyze a dataset from Netflix database to explore the characteristics that people share in movies’ taste, based on how they rate them.
This post empowers the Pythonista, with a complete framework to explore the world of data on the internet?-?all behind randomized proxy servers in a fast parallelized sequence, while protecting your company’s immutable IP from curious eyes, and other potential trolls. With this new outlet, the reader is requested to take all measures, and to not abuse the privilege of their acquired ghost-ninja skills, to not tax any such services inappropriately, nor unethically. The user takes all responsibility for implementing (of course) and all risks associated with running the attached code.
PyTorch is one of the most popular Deep Learning frameworks that is based on Python and is supported by Facebook. In this article we will be looking into the classes that PyTorch provides for helping with Natural Language Processing (NLP).
With all the great sophisticated data tools that exist out there these days, it’s easy to think that spreadsheets are too primitive for use in serious data science work. The fact that there’s literally 20+ years of literature cautioning people about the evils of spreadsheets makes it sound like a ‘real data professional’ should know better than to use such antiquated things. But it’s probably the greatest Swiss army chainsaw for data for the sorts of ugly work that no one ever wants to admit they have to do every day. In an ideal world they wouldn’t be necessary, but when there’s a combination of tech debt, time pressure, poor data quality, and stakeholders who don’t know anything but spreadsheets, they’re invaluable.
Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image. It can be applied to a wide range of applications, such as collection style transfer, object transfiguration,season transfer and photo enhancement.
The EM algorithm finds maximum-likelihood estimates for model parameters when you have incomplete data. The ‘E-Step’ finds probabilities for the assignment of data points, based on a set of hypothesized probability density functions; The ‘M-Step’ updates the original hypothesis with new data. The cycle repeats until the parameters stabilize.
1. Automation of DevOps to achieve AIOps
2. The Emergence of More Machine Learning Platforms
3. Augmented Reality
4. Agent-Based Simulations
5. IoT
6. AI Optimized Hardware
7. Natural Language Generation
8. Streaming Data Platforms
9. Driverless Vehicles
10. Conversational BI and Analytics

### Book Memo: “Reproducible Econometrics Using R”

 This book is designed to facilitate reproducibility in Econometrics. It does so by using open source software (R) and recently developed tools (R Markdown and bookdown) that allow the reader to engage in reproducible research. Illustrative examples are provided throughout, and a range of topics are covered. Assignments, exams, slides, and a solution manual are available for instructors.

### How cdata Control Table Data Transforms Work

With all of the excitement surrounding cdata style control table based data transforms (the cdata ideas being named as the “replacements” for tidyr‘s current methodology, by the tidyr authors themselves!) I thought I would take a moment to describe how they work.

cdata defines two primary data manipulation operators: rowrecs_to_blocks() and blocks_to_rowrecs(). These are the fundamental transforms that convert between data representations. The two representations it converts between are:

• A world where all facts about an instance or record are in a single row (“rowrecs”).
• A world where all facts about an instance or record are in groups of rows (“blocks”).

It turns out once you develop the idea of specifying the data transformation as explicit data (an application of Erick S. Raymond’s admonition: “fold knowledge into data, so program logic can be stupid and robust.”), you have also a great tool for reasoning and teaching data transforms.

For example:

rowrecs_to_blocks() does the following. For each row record, make a replicant of the of the control table with values filled in. In relational terms rowrecs_to_blocks() is therefore a join of the data to the control table. Conversely blocks_to_rowrecs() combines groups of rows into single rows, so in relational terms it is an aggregation or projection. If each of these operations is faithful (keeps enough information around) they are then inverse of each other.

We share some nifty tutorials on the ideas here:

One can build fairly clever illustrations and animations to teach the above.

The most common special cases of the above have been popularized in R as unpivot/pivot (pivot invented by Pito Salas), stack/unstack, melt/cast, or gather/spread. These special cases are handled in cdata by convenience functions unpivot_to_blocks() and pivot_to_rowrecs(). A great example of a “higher order” transform that isn’t one of the common ones is given here.

Note: the above theory and implementation is joint work of Nina Zumel and John Mount and can be found here. We would really appreciate any citations or credit you can send our way (or even politely correcting those who don’t attribute the work or attribute the work to others, as there are already a lot of mentions without credit or citation).

citation("cdata")

To cite package ‘cdata’ in publications use:

John Mount and Nina Zumel (2019). cdata: Fluid Data Transformations. https://github.com/WinVector/cdata/,
https://winvector.github.io/cdata/.

A BibTeX entry for LaTeX users is

@Manual{,
title = {cdata: Fluid Data Transformations},
author = {John Mount and Nina Zumel},
year = {2019},
note = {https://github.com/WinVector/cdata/, https://winvector.github.io/cdata/},
}


### Science and Technology links (March 23rd 2019)

1. Half of American households subscribe to “Amazon Prime”, a “club membership” for Amazon customers with monthly fees. And about half of these subscribes buy something from Amazon every week. If you are counting, this seems to imply that at least a quarter of all American households order something from Amazon every week.
2. How do the preprints that researchers post online freely differ from genuine published articles that underwent peer review? Maybe less than you’d expect:

our results show that quality of reporting in preprints in the life sciences is within a similar range as that of peer-reviewed articles

3. Very low meat consumption might increase the long-term risk of dementia and Alzheimer’s.
4. We appear to be no closer to find a cure for Alzheimer’s despite billions being spent each year in research and clinical trials. Lower writes:

Something is wrong with the way we’re thinking about Alzheimer’s (…) It’s been wrong for a long time and that’s been clear for a long time. Do something else.

5. Many researchers use “p values” (a statistical measure) to prove that their results are “significant”. Ioannidis argues that most research should not rely on p values.
6. Eating nuts improves cognition (nuts make you smart).
7. As we age, we become more prone to diabetes. According to an article in Nature, senescent cells in the immune system may lead to diabetes. Senescent cells that are cells that should be dead due to damage or too many divisions, but they refuse to die.
8. Hospitalizations for heart attacks have declined by 38% in the last 20 years and mortality is at all time low. Though clinicians and health professionals take the credit, I am not convinced we understand the source of this progress.
9. In stories, females identify more strongly with their own gender whereas males identify equally with either gender.
10. Theranos was a large company that pretended to be able to do better blood tests. The company was backed by several granted patents. Yet we know that Theranos technology did not work. The problem we are facing now is that Theranos patents, granted on false pretenses and vague claims, remain valid and will hurt genuine inventors in the future. If we are to have patents at all, they should only be granted for inventions that work. Nazer argues that the patent system is broken.
11. Smaller groups tend to create more innovative work, and larger groups less so.
12. The bones of older people become fragile. A leading cause of this problem is the fact stem cells in our bones become less active. It appears that this is caused by excessive inflammation. We can create it in young mice by exposing them to the blood serum of old mice. We can also reverse it in old mice by using an anti-inflammatory drug (akin to aspirin).
13. Gene therapy helped mice regain sight lost due to retinal degeneration. It could work in human beings too.
14. Based on ecological models, scientists predicted over ten years ago that polar bear populations would soon collapse. That has not happened: there may be several times more polar bears than decades ago. It is true that ice coverage is lower than it has been historically due to climate change, but it is apparently incorrect to assume that polar bears need thick ice; they may in fact thrive when the ice is thin and the summers are long. Crowford, a zoologist and professor at the University of Victory tells the tale in her book The Polar Bear Catastrophe That Never Happened.

### Yes, I really really really like fake-data simulation, and I can’t stop talking about it.

Rajesh Venkatachalapathy writes:

Recently, I had a conversation with a colleague of mine about the virtues of synthetic data and their role in data analysis. I think I’ve heard a sermon/talk or two where you mention this and also in your blog entries. But having convinced my colleague of this point, I am struggling to find good references on this topic.

I was hoping to get some leads from you.

Hi, here are some refs: from 2009, 2011, 2013, also this and this and this from 2017, and this from 2018. I think I’ve missed a few, too.

If you want something in dead-tree style, see Section 8.1 of my book with Jennifer Hill, which came out in 2007.

Or, for some classic examples, there’s Bush and Mosteller with the “stat-dogs” in 1954, and Ripley with his simulated spatial processes from, ummmm, 1987 I think it was? Good stuff, all. We should be doing more of it.

### Strength of a Lennon song exposed with R function glue::glue

(This article was first published on Guillaume Pressiat, and kindly contributed to R-bloggers)

As a return, parameters sometimes gives echoes of poetry.

We could also read title of this article as “strength of an R function exposed with a Lennon song”…

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### A decade of the Datablog: 'There's a human story behind every data point'

The Guardian’s data editors in the UK, US and Australia explain how their work has influenced our journalism

The Datablog was launched in March 2009, starting in a corner of the Guardian website dedicated to the publication and analysis of data. In the last decade it has published thousands of stories and datasets on every topic imaginable, from Reading the Riots to how the UK fared in every Eurovision song contest, and its influence lives on throughout our data journalism.

How did it all begin? This is what its founder, Simon Rogers, remembers: