# My Data Science Blogs

## July 22, 2019

### Finding out why

The paper demonstrates that the matching estimator is not generally consistent for the average treatment effect of the treated when the matching is done without replacement using propensity scores. To achieve consistency, practitioners must either assume that no unit exists with a propensity score greater than one-half or assume that there is no confounding among such units. Illustrations suggest that the result applies also to matching using other metrics as long as it is done without replacement.
A growing number of methods aim to assess the challenging question of treatment effect variation in observational studies. This special section of ‘Observational Studies’ reports the results of a workshop conducted at the 2018 Atlantic Causal Inference Conference designed to understand the similarities and differences across these methods. We invited eight groups of researchers to analyze a synthetic observational data set that was generated using a recent large-scale randomized trial in education. Overall, participants employed a diverse set of methods, ranging from matching and flexible outcome modeling to semiparametric estimation and ensemble approaches. While there was broad consensus on the topline estimate, there were also large differences in estimated treatment effect moderation. This highlights the fact that estimating varying treatment effects in observational studies is often more challenging than estimating the average treatment effect alone. We suggest several directions for future work arising from this workshop.
Despite significant progress in dissecting the genetic architecture of complex diseases by genome-wide association studies (GWAS), the signals identified by association analysis may not have specific pathological relevance to diseases so that a large fraction of disease causing genetic variants is still hidden. Association is used to measure dependence between two variables or two sets of variables. Genome-wide association studies test association between a disease and SNPs (or other genetic variants) across the genome. Association analysis may detect superficial patterns between disease and genetic variants. Association signals provide limited information on the causal mechanism of diseases. The use of association analysis as a major analytical platform for genetic studies of complex diseases is a key issue that hampers discovery of the mechanism of diseases, calling into question the ability of GWAS to identify loci underlying diseases. It is time to move beyond association analysis toward techniques enabling the discovery of the underlying causal genetic strctures of complex diseases. To achieve this, we propose a concept of a genome-wide causation studies (GWCS) as an alternative to GWAS and develop additive noise models (ANMs) for genetic causation analysis. Type I error rates and power of the ANMs to test for causation are presented. We conduct GWCS of schizophrenia. Both simulation and real data analysis show that the proportion of the overlapped association and causation signals is small. Thus, we hope that our analysis will stimulate discussion of GWAS and GWCS.
How long should an online article title be? There’s a blog here citing an old post from 2013 which shows a nice plot for average click-through rate (CTR) and title length.
Mutual information has been successfully adopted in filter feature-selection methods to assess both the relevancy of a subset of features in predicting the target variable and the redundancy with respect to other variables. However, existing algorithms are mostly heuristic and do not offer any guarantee on the proposed solution. In this paper, we provide novel theoretical results showing that conditional mutual information naturally arises when bounding the ideal regression/classification errors achieved by different subsets of features. Leveraging on these insights, we propose a novel stopping condition for backward and forward greedy methods which ensures that the ideal prediction error using the selected feature subset remains bounded by a user-specified threshold. We provide numerical simulations to support our theoretical claims and compare to common heuristic methods.
The assumption of positivity in causal inference (also known as common support and co-variate overlap) is necessary to obtain valid causal estimates. Therefore, confirming it holds in a given dataset is an important first step of any causal analysis. Most common methods to date are insufficient for discovering non-positivity, as they do not scale for modern high-dimensional covariate spaces, or they cannot pinpoint the subpopulation violating positivity. To overcome these issues, we suggest to harness decision trees for detecting violations. By dividing the covariate space into mutually exclusive regions, each with maximized homogeneity of treatment groups, decision trees can be used to automatically detect subspaces violating positivity. By augmenting the method with an additional random forest model, we can quantify the robustness of the violation within each subspace. This solution is scalable and provides an interpretable characterization of the subspaces in which violations occur. We provide a visualization of the stratification rules that define each subpopulation, combined with the severity of positivity violation within it. We also provide an interactive version of the visualization that allows a deeper dive into the properties of each subspace.

### The title CDO started out as a joke

How did the role of Chief Data Officer come to drive data literacy at companies around the world? Find out how it all began in this interview with the first who held the title at Yahoo!

### Four short links: 22 July 2019

Game Source, Procurement Graph, Data Moats, and Antitrust Regulation

1. Game Source Code -- Internet Archive has a collection of video game source code. The majority of these titles were originally released as commercial products and the source code was made available to the public at a later time.
2. European Public Procurement Knowledge Graph -- over 23 million triples (records), covering information about almost 220,000 tenders, built to support competitiveness and accountability by TheyBuyForYou. (via University of Southampton)
3. The Empty Promise of Data Moats (Andreessen-Horowitz) -- business model wonks reckoned that "data network effects" were a thing, but the benefits seen by companies claiming data network effects seem to be the benefits of simply having a lot of data. And that's not as defensible as hoped. I liked this essay.
4. Why Big Tech Keeps Outsmarting Antitrust Regulators (Tim O'Reilly) -- designers of marketplace-platform algorithms and screen layouts can arbitrarily allocate value to whom they choose. The marketplace is designed and controlled by its owners, and that design shapes “who gets what and why.” [...] Power over sellers ultimately translates into power over customers as well. When it comes to antitrust, the question of market power must be answered by analyzing the effect of these marketplace designs on both buyers and sellers, and how they change over time. How much of the value goes to the platform, how much to consumers, and how much to suppliers?

### What’s the Best Data Strategy for Enterprises: Build, buy, partner or acquire?

Every large organization is investing heavily in building data solutions and tools. They are building data solutions from scratch when they could be taking advantage of readily available tools and solutions. Many organizations are re-inventing the wheel and wasting resources.

### Keras learning rate schedules and decay

In this tutorial, you will learn about learning rate schedules and decay using Keras. You’ll learn how to use Keras’ standard learning rate decay along with step-based, linear, and polynomial learning rate schedules.

When training a neural network, the learning rate is often the most important hyperparameter for you to tune:

• Too small a learning rate and your neural network may not learn at all
• Too large a learning rate and you may overshoot areas of low loss (or even overfit from the start of training)

When it comes to training a neural network, the most bang for your buck (in terms of accuracy) is going to come from selecting the correct learning rate and appropriate learning rate schedule.

But that’s easier said than done.

To help deep learning practitioners such as yourself learn how to assess a problem and choose an appropriate learning rate, we’ll be starting a series of tutorials on learning rate schedules, decay, and hyperparameter tuning with Keras.

By the end of this series, you’ll have a good understanding of how to appropriately and effectively apply learning rate schedules with Keras to your own deep learning projects.

To learn how to use Keras for learning rate schedules and decay, just keep reading

Looking for the source code to this post?

## Keras learning rate schedules and decay

In the first part of this guide, we’ll discuss why the learning rate is the most important hyperparameter when it comes to training your own deep neural networks.

We’ll then dive into why we may want to adjust our learning rate during training.

From there I’ll show you how to implement and utilize a number of learning rate schedules with Keras, including:

• The decay schedule built into most Keras optimizers
• Step-based learning rate schedules
• Linear learning rate decay
• Polynomial learning rate schedules

We’ll then perform a number of experiments on the CIFAR-10 using these learning rate schedules and evaluate which one performed the best.

These sets of experiments will serve as a template you can use when exploring your own deep learning projects and selecting an appropriate learning rate and learning rate schedule.

### Why adjust our learning rate and use learning rate schedules?

To see why learning rate schedules are a worthwhile method to apply to help increase model accuracy and descend into areas of lower loss, consider the standard weight update formula used by nearly all neural networks:

$W += \alpha * gradient$

Recall that the learning rate, $\alpha$, controls the “step” we make along the gradient. Larger values of $\alpha$ imply that we are taking bigger steps. While smaller values of $\alpha$ will make tiny steps. If $\alpha$ is zero the network cannot make any steps at all (since the gradient multiplied by zero is zero).

Most initial learning rates (but not all) you encounter are typically in the set $\alpha = \{1e^{-1}, 1e^{-2}, 1e^{-3}\}$ .

A network is then trained for a fixed number of epochs without changing the learning rate.

This method may work well in some situations, but it’s often beneficial to decrease our learning rate over time. When training our network, we are trying to find some location along our loss landscape where the network obtains reasonable accuracy. It doesn’t have to be a global minima or even a local minima, but in practice, simply finding an area of the loss landscape with reasonably low loss is “good enough”.

If we constantly keep a learning rate high, we could overshoot these areas of low loss as we’ll be taking too large of steps to descend into those series.

Instead, what we can do is decrease our learning rate, thereby allowing our network to take smaller steps — this decreased learning rate enables our network to descend into areas of the loss landscape that are “more optimal” and would have otherwise been missed entirely by our learning rate learning.

We can, therefore, view the process of learning rate scheduling as:

1. Finding a set of reasonably “good” weights early in the training process with a larger learning rate.
2. Tuning these weights later in the process to find more optimal weights using a smaller learning rate.

We’ll be covering some of the most popular learning rate schedules in this tutorial.

### Project structure

tree
command to inspect the project folder:

$tree . ├── output │ ├── lr_linear_schedule.png │ ├── lr_poly_schedule.png │ ├── lr_step_schedule.png │ ├── train_linear_schedule.png │ ├── train_no_schedule.png │ ├── train_poly_schedule.png │ ├── train_standard_schedule.png │ └── train_step_schedule.png ├── pyimagesearch │ ├── __init__.py │ ├── learning_rate_schedulers.py │ └── resnet.py └── train.py 2 directories, 12 files Our output/ directory will contain learning rate and training history plots. The five experiments included in the results section correspond to the five plots with the train_*.png filenames, respectively. The pyimagesearch module contains our ResNet CNN and our learning_rate_schedulers.py . The LearningRateDecay parent class simply includes a method called plot for plotting each of our types of learning rate decay. Also included are subclasses, StepDecay and PolynomialDecay which calculate the learning rate upon the completion of each epoch. Both of these classes contain the plot method via inheritance (an object-oriented concept). Our training script, train.py , will train ResNet on the CIFAR-10 dataset. We’ll run the script with the absence of learning rate decay as well as standard, linear, step-based, and polynomial learning rate decay. ### The standard “decay” schedule in Keras The Keras library ships with a time-based learning rate scheduler — it is controlled via the decay parameter of the optimizer class (such as SGD , Adam , etc.). To discover how we can utilize this type of learning rate decay, let’s take a look at an example of how we may initialize the ResNet architecture and the SGD optimizer: # initialize our optimizer and model, then compile it opt = SGD(lr=1e-2, momentum=0.9, decay=1e-2/epochs) model = ResNet.build(32, 32, 3, 10, (9, 9, 9), (64, 64, 128, 256), reg=0.0005) model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"]) Here we initialize our SGD optimizer with an initial learning rate of 1e-2 . We then set our decay to be the learning rate divided by the total number of epochs we are training the network for (a common rule of thumb). Internally, Keras applies the following learning rate schedule to adjust the learning rate after every batch update — it is a misconception that Keras updates the standard decay after every epoch. Keep this in mind when using the default learning rate scheduler supplied with Keras. The update formula follows: $lr = init\_lr * \frac{1.0}{1.0 + decay * iterations}$ Using the CIFAR-10 dataset as an example, we have a total of 50,000 training images. If we use a batch size of 64 , that implies there are a total of $\lceil50000 / 64\rceil = 782$ steps per epoch. Therefore, a total of 782 weight updates need to be applied before an epoch completes. To see an example of the learning rate schedule calculation, let’s assume our initial learning rate is $\alpha = 0.01$ and our $decay = \frac{0.01}{40}$ (with the assumption that we are training for forty epochs). The learning rate at step zero, before any learning rate schedule has been applied, is: $lr = 0.01 * \frac{1.0}{1.0 + 0.00025 * (0 * 782)} = 0.01$ At the beginning of epoch one we can see the following learning rate: $lr = 0.01 * \frac{1.0}{(1.0 + 0.00025 * (1 * 782)} = 0.00836$ Figure 1 below continues the calculation of Keras’ standard learning rate decay $\alpha =0.01$ and a decay of $\frac{0.01}{40}$: Figure 1: Keras’ standard learning rate decay table. You’ll learn how to utilize this type of learning rate decay inside the “Implementing our training script” and “Keras learning rate schedule results” sections of this post, respectively. ### Our LearningRateDecay class In the remainder of this tutorial, we’ll be implementing our own custom learning rate schedules and then incorporating them with Keras when training our neural networks. To keep our code neat and tidy, and not to mention, follow object-oriented programming best practices, let’s first define a base LearningRateDecay class that we’ll subclass for each respective learning rate schedule. Open up the learning_rate_schedulers.py in your directory structure and insert the following code: # import the necessary packages import matplotlib.pyplot as plt import numpy as np class LearningRateDecay: def plot(self, epochs, title="Learning Rate Schedule"): # compute the set of learning rates for each corresponding # epoch lrs = [self(i) for i in epochs] # the learning rate schedule plt.style.use("ggplot") plt.figure() plt.plot(epochs, lrs) plt.title(title) plt.xlabel("Epoch #") plt.ylabel("Learning Rate") Each and every learning rate schedule we implement will have a plot function, enabling us to visualize our learning rate over time. With our base LearningRateSchedule class implement, let’s move on to creating a step-based learning rate schedule. ### Step-based learning rate schedules with Keras Figure 2: Keras learning rate step-based decay. The schedule in red is a decay factor of 0.5 and blue is a factor of 0.25. One popular learning rate scheduler is step-based decay where we systematically drop the learning rate after specific epochs during training. The step decay learning rate scheduler can be seen as a piecewise function, as visualized in Figure 2 — here the learning rate is constant for a number of epochs, then drops, is constant once more, then drops again, etc. When applying step decay to our learning rate, we have two options: 1. Define an equation that models the piecewise drop-in learning rate that we wish to achieve. 2. Use what I call the ctrl + c method to train a deep neural network. Here we train for some number of epochs at a given learning rate and eventually notice validation performance stagnating/stalling, then ctrl + c to stop the script, adjust our learning rate, and continue training. We’ll primarily be focusing on the equation-based piecewise drop to learning rate scheduling in this post. The ctrl + c method is a bit more advanced and normally applied to larger datasets using deeper neural networks where the exact number of epochs required to obtain a reasonable model is unknown. If you’d like to learn more about the ctrl + c method to training, please refer to Deep Learning for Computer Vision with Python. When applying step decay, we often drop our learning rate by either (1) half or (2) an order of magnitude after every fixed number of epochs. For example, let’s suppose our initial learning rate is $\alpha = 0.01$. After 10 epochs we drop the learning rate to $\alpha = 0.005$. After another 10 epochs (i.e., the 20th total epoch), $\alpha$ is dropped by a factor of 0.5 again, such that $\alpha = 0.0025$, etc. In fact, this is the exact same learning rate schedule that is depicted in Figure 2 (red line). The blue line displays a more aggressive drop factor of 0.25 . Modeled mathematically, we can define our step-based decay equation as: $\alpha_{E + 1} = \alpha_{I} \times F^{(1 + E) / D}$ Where $\alpha_{I}$ is the initial learning rate, $F$ is the factor value controlling the rate in which the learning date drops, D is the “Drop every” epochs value, and E is the current epoch. The larger our factor $F$ is, the slower the learning rate will decay. Conversely, the smaller the factor $F$, the faster the learning rate will decay. All that said, let’s go ahead and implement our StepDecay class now. Go back to your learning_rate_schedulers.py file and insert the following code: class StepDecay(LearningRateDecay): def __init__(self, initAlpha=0.01, factor=0.25, dropEvery=10): # store the base initial learning rate, drop factor, and # epochs to drop every self.initAlpha = initAlpha self.factor = factor self.dropEvery = dropEvery def __call__(self, epoch): # compute the learning rate for the current epoch exp = np.floor((1 + epoch) / self.dropEvery) alpha = self.initAlpha * (self.factor ** exp) # return the learning rate return float(alpha) Line 20 defines the constructor to our StepDecay class. We then store the initial learning rate ( initAlpha ), drop factor, and dropEvery epochs values (Lines 23-25). The __call__ function: • Accepts the current epoch number. • Computes the learning rate based on the step-based decay formula detailed above (Lines 29 and 30). • Returns the computed learning rate for the current epoch (Line 33). You’ll see how to use this learning rate schedule later in this post. ### Linear and polynomial learning rate schedules in Keras Two of my favorite learning rate schedules are linear learning rate decay and polynomial learning rate decay. Using these methods our learning rate is decayed to zero over a fixed number of epochs. The rate in which the learning rate is decayed is based on the parameters to the polynomial function. A smaller exponent/power to the polynomial will cause the learning rate to decay “more slowly”, whereas larger exponents decay the learning rate “more quickly”. Conveniently, both of these methods can be implemented in a single class: class PolynomialDecay(LearningRateDecay): def __init__(self, maxEpochs=100, initAlpha=0.01, power=1.0): # store the maximum number of epochs, base learning rate, # and power of the polynomial self.maxEpochs = maxEpochs self.initAlpha = initAlpha self.power = power def __call__(self, epoch): # compute the new learning rate based on polynomial decay decay = (1 - (epoch / float(self.maxEpochs))) ** self.power alpha = self.initAlpha * decay # return the new learning rate return float(alpha) Line 36 defines the constructor to our PolynomialDecay class which requires three values: • maxEpochs : The total number of epochs we’ll be training for. • initAlpha : The initial learning rate. • power : The power/exponent of the polynomial. Note that if you set power=1.0 then you have a linear learning rate decay. Lines 45 and 46 compute the adjusted learning rate for the current epoch while Line 49 returns the new learning rate. ### Implementing our training script Now that we’ve implemented a few different Keras learning rate schedules, let’s see how we can use them inside an actual training script. Create a file named train.py file in your editor and insert the following code: # set the matplotlib backend so figures can be saved in the background import matplotlib matplotlib.use("Agg") # import the necessary packages from pyimagesearch.learning_rate_schedulers import StepDecay from pyimagesearch.learning_rate_schedulers import PolynomialDecay from pyimagesearch.resnet import ResNet from sklearn.preprocessing import LabelBinarizer from sklearn.metrics import classification_report from keras.callbacks import LearningRateScheduler from keras.optimizers import SGD from keras.datasets import cifar10 import matplotlib.pyplot as plt import numpy as np import argparse Lines 2-16 import required packages. Line 3 sets the matplotlib backend so that we can create plots as image files. Our most notable imports include: • StepDecay : Our class which calculates and plots step-based learning rate decay. • PolynomialDecay : The class we wrote to calculate polynomial-based learning rate decay. • ResNet : Our Convolutional Neural Network implemented in Keras. • LearningRateScheduler : A Keras callback. We’ll pass our learning rate schedule to this class which will be called as a callback at the completion of each epoch to calculate our learning rate. Let’s move on and parse our command line arguments: # construct the argument parser and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-s", "--schedule", type=str, default="", help="learning rate schedule method") ap.add_argument("-e", "--epochs", type=int, default=100, help="# of epochs to train for") ap.add_argument("-l", "--lr-plot", type=str, default="lr.png", help="path to output learning rate plot") ap.add_argument("-t", "--train-plot", type=str, default="training.png", help="path to output training plot") args = vars(ap.parse_args()) Our script accepts any of four command line arguments when the script is called via the terminal: • --schedule : The learning rate schedule method. Valid options are “standard”, “step”, “linear”, “poly”. By default, no learning rate schedule will be used. • --epochs : The number of epochs to train for ( default=100 ). • --lr-plot : The path to the output plot. I suggest overriding the default of lr.png with a more descriptive path + filename. • --train-plot : The path to the output accuracy/loss training history plot. Again, I suggest a descriptive path + filename, otherwise training.png will be set by default . With our imports and command line arguments in hand, now it’s time to initialize our learning rate schedule: # store the number of epochs to train for in a convenience variable, # then initialize the list of callbacks and learning rate scheduler # to be used epochs = args["epochs"] callbacks = [] schedule = None # check to see if step-based learning rate decay should be used if args["schedule"] == "step": print("[INFO] using 'step-based' learning rate decay...") schedule = StepDecay(initAlpha=1e-1, factor=0.25, dropEvery=15) # check to see if linear learning rate decay should should be used elif args["schedule"] == "linear": print("[INFO] using 'linear' learning rate decay...") schedule = PolynomialDecay(maxEpochs=epochs, initAlpha=1e-1, power=1) # check to see if a polynomial learning rate decay should be used elif args["schedule"] == "poly": print("[INFO] using 'polynomial' learning rate decay...") schedule = PolynomialDecay(maxEpochs=epochs, initAlpha=1e-1, power=5) # if the learning rate schedule is not empty, add it to the list of # callbacks if schedule is not None: callbacks = [LearningRateScheduler(schedule)] Line 33 sets the number of epochs we will train for directly from the command line args variable. From there we’ll initialize our callbacks list and learning rate schedule (Lines 34 and 35). Lines 38-50 then select the learning rate schedule if args["schedule"] contains a valid value: • "step" : Initializes StepDecay . • "linear" : Initializes PolynomialDecay with power=1 indicating that a linear learning rate decay will be utilized. • "poly" : PolynomialDecay with a power=5 will be used. After you’ve reproduced the results of the experiments in this tutorial, be sure to revisit Lines 38-50 and insert additional elif statements of your own so you can run some of your own experiments! Lines 54 and 55 initialize the LearningRateScheduler with the schedule as a single callback part of the callbacks list. There is a case where no learning rate decay will be used (i.e. if the --schedule command line argument is not overridden when the script is executed). Let’s go ahead and load our data: # load the training and testing data, then scale it into the # range [0, 1] print("[INFO] loading CIFAR-10 data...") ((trainX, trainY), (testX, testY)) = cifar10.load_data() trainX = trainX.astype("float") / 255.0 testX = testX.astype("float") / 255.0 # convert the labels from integers to vectors lb = LabelBinarizer() trainY = lb.fit_transform(trainY) testY = lb.transform(testY) # initialize the label names for the CIFAR-10 dataset labelNames = ["airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"] Line 60 loads our CIFAR-10 data. The dataset is conveniently already split into training and testing sets. The only preprocessing we must perform is to scale the data into the range [0, 1] (Lines 61 and 62). Lines 65-67 binarize the labels and then Lines 70 and 71 initialize our labelNames (i.e. classes). Do not add to or alter the labelNames list as order and length of the list matter. Let’s initialize decay parameter: # initialize the decay for the optimizer decay = 0.0 # if we are using Keras' "standard" decay, then we need to set the # decay parameter if args["schedule"] == "standard": print("[INFO] using 'keras standard' learning rate decay...") decay = 1e-1 / epochs # otherwise, no learning rate schedule is being used elif schedule is None: print("[INFO] no learning rate schedule being used") Line 74 initializes our learning rate decay . If we’re using the "standard" learning rate decay schedule, then the decay is initialized as 1e-1 / epochs (Lines 78-80). With all of our initializations taken care of, let’s go ahead and compile + train our ResNet model: # initialize our optimizer and model, then compile it opt = SGD(lr=1e-1, momentum=0.9, decay=decay) model = ResNet.build(32, 32, 3, 10, (9, 9, 9), (64, 64, 128, 256), reg=0.0005) model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"]) # train the network H = model.fit(trainX, trainY, validation_data=(testX, testY), batch_size=128, epochs=epochs, callbacks=callbacks, verbose=1) Our Stochastic Gradient Descent ( SGD ) optimizer is initialized on Line 87 using our decay . From there, Lines 88 and 89 build our ResNet CNN with an input shape of 32x32x3 and 10 classes. For an in-depth review of ResNet, be sure refer to Chapter 10: ResNet of Deep Learning for Computer Vision with Python. Our model is compiled with a loss function of "categorical_crossentropy" since our dataset has > 2 classes. If you use a different dataset with only 2 classes, be sure to use loss="binary_crossentropy" . Lines 94 and 95 kick of our training process. Notice that we’ve provided the callbacks as a parameter. The callbacks will be called when each epoch is completed. Our LearningRateScheduler contained therein will handle our learning rate decay (so long as callbacks isn’t an empty list). Finally, let’s evaluate our network and generate plots: # evaluate the network print("[INFO] evaluating network...") predictions = model.predict(testX, batch_size=128) print(classification_report(testY.argmax(axis=1), predictions.argmax(axis=1), target_names=labelNames)) # plot the training loss and accuracy N = np.arange(0, args["epochs"]) plt.style.use("ggplot") plt.figure() plt.plot(N, H.history["loss"], label="train_loss") plt.plot(N, H.history["val_loss"], label="val_loss") plt.plot(N, H.history["acc"], label="train_acc") plt.plot(N, H.history["val_acc"], label="val_acc") plt.title("Training Loss and Accuracy on CIFAR-10") plt.xlabel("Epoch #") plt.ylabel("Loss/Accuracy") plt.legend() plt.savefig(args["train_plot"]) # if the learning rate schedule is not empty, then save the learning # rate plot if schedule is not None: schedule.plot(N) plt.savefig(args["lr_plot"]) Lines 99-101 evaluate our network and print a classification report to our terminal. Lines 104-115 generate and save our training history plot (accuracy/loss curves). Lines 119-121 generate a learning rate schedule plot, if applicable. We will inspect these plot visualizations in the next section. ### Keras learning rate schedule results With both our (1) learning rate schedules and (2) training scripts implemented, let’s run some experiments to see which learning rate schedule will perform best given: 1. An initial learning rate of 1e-1 2. Training for a total of 100 epochs #### Experiment #1: No learning rate decay/schedule As a baseline, let’s first train our ResNet model on CIFAR-10 with no learning rate decay or schedule: $ python train.py --train-plot output/train_no_schedule.png
[INFO] no learning rate being used
Train on 50000 samples, validate on 10000 samples
Epoch 1/100
50000/50000 [==============================] - 186s 4ms/step - loss: 2.1204 - acc: 0.4372 - val_loss: 1.9361 - val_acc: 0.5118
Epoch 2/100
50000/50000 [==============================] - 171s 3ms/step - loss: 1.5150 - acc: 0.6440 - val_loss: 1.5013 - val_acc: 0.6413
Epoch 3/100
50000/50000 [==============================] - 171s 3ms/step - loss: 1.2186 - acc: 0.7369 - val_loss: 1.2288 - val_acc: 0.7315
...
Epoch 98/100
50000/50000 [==============================] - 171s 3ms/step - loss: 0.5220 - acc: 0.9568 - val_loss: 1.0223 - val_acc: 0.8372
Epoch 99/100
50000/50000 [==============================] - 171s 3ms/step - loss: 0.5349 - acc: 0.9532 - val_loss: 1.0423 - val_acc: 0.8230
Epoch 100/100
50000/50000 [==============================] - 171s 3ms/step - loss: 0.5209 - acc: 0.9579 - val_loss: 0.9883 - val_acc: 0.8421
[INFO] evaluating network...
precision    recall  f1-score   support

airplane       0.84      0.86      0.85      1000
automobile       0.90      0.93      0.92      1000
bird       0.83      0.74      0.78      1000
cat       0.67      0.79      0.73      1000
deer       0.78      0.88      0.83      1000
dog       0.85      0.69      0.76      1000
frog       0.85      0.89      0.87      1000
horse       0.94      0.82      0.88      1000
ship       0.91      0.90      0.90      1000
truck       0.90      0.90      0.90      1000

micro avg       0.84      0.84      0.84     10000
macro avg       0.85      0.84      0.84     10000
weighted avg       0.85      0.84      0.84     10000

Figure 3: Our first experiment for training ResNet on CIFAR-10 does not have learning rate decay.

Here we obtain ~85% accuracy, but as we can see, validation loss and accuracy stagnate past epoch ~15 and do not improve over the rest of the 100 epochs.

Our goal is now to utilize learning rate scheduling to beat our 85% accuracy (without overfitting).

#### Experiment: #2: Keras standard optimizer learning rate decay

In our second experiment we are going to use Keras’ standard decay-based learning rate schedule:

$python train.py --schedule standard --train-plot output/train_standard_schedule.png [INFO] loading CIFAR-10 data... [INFO] using 'keras standard' learning rate decay... Train on 50000 samples, validate on 10000 samples Epoch 1/100 50000/50000 [==============================] - 184s 4ms/step - loss: 2.1074 - acc: 0.4460 - val_loss: 1.8397 - val_acc: 0.5334 Epoch 2/100 50000/50000 [==============================] - 171s 3ms/step - loss: 1.5068 - acc: 0.6516 - val_loss: 1.5099 - val_acc: 0.6663 Epoch 3/100 50000/50000 [==============================] - 171s 3ms/step - loss: 1.2097 - acc: 0.7512 - val_loss: 1.2928 - val_acc: 0.7176 ... Epoch 98/100 50000/50000 [==============================] - 171s 3ms/step - loss: 0.1752 - acc: 1.0000 - val_loss: 0.8892 - val_acc: 0.8209 Epoch 99/100 50000/50000 [==============================] - 171s 3ms/step - loss: 0.1746 - acc: 1.0000 - val_loss: 0.8923 - val_acc: 0.8204 Epoch 100/100 50000/50000 [==============================] - 171s 3ms/step - loss: 0.1740 - acc: 1.0000 - val_loss: 0.8924 - val_acc: 0.8208 [INFO] evaluating network... precision recall f1-score support airplane 0.81 0.86 0.84 1000 automobile 0.91 0.91 0.91 1000 bird 0.75 0.71 0.73 1000 cat 0.68 0.65 0.66 1000 deer 0.78 0.81 0.79 1000 dog 0.77 0.74 0.75 1000 frog 0.83 0.88 0.85 1000 horse 0.86 0.87 0.86 1000 ship 0.90 0.90 0.90 1000 truck 0.90 0.88 0.89 1000 micro avg 0.82 0.82 0.82 10000 macro avg 0.82 0.82 0.82 10000 weighted avg 0.82 0.82 0.82 10000 Figure 4: Our second learning rate decay schedule experiment uses Keras’ standard learning rate decay schedule. This time we only obtain 82% accuracy, which goes to show, learning rate decay/scheduling will not always improve your results! You need to be careful which learning rate schedule you utilize. #### Experiment #3: Step-based learning rate schedule results Let’s go ahead and perform step-based learning rate scheduling which will drop our learning rate by a factor of 0.25 every 15 epochs: $ python train.py --schedule step --lr-plot output/lr_step_schedule.png --train-plot output/train_step_schedule.png
[INFO] using 'step-based' learning rate decay...
Train on 50000 samples, validate on 10000 samples
Epoch 1/100
50000/50000 [==============================] - 186s 4ms/step - loss: 2.2839 - acc: 0.4328 - val_loss: 1.8936 - val_acc: 0.5530
Epoch 2/100
50000/50000 [==============================] - 171s 3ms/step - loss: 1.6425 - acc: 0.6213 - val_loss: 1.4599 - val_acc: 0.6749
Epoch 3/100
50000/50000 [==============================] - 171s 3ms/step - loss: 1.2971 - acc: 0.7177 - val_loss: 1.3298 - val_acc: 0.6953
...
Epoch 98/100
50000/50000 [==============================] - 171s 3ms/step - loss: 0.1817 - acc: 1.0000 - val_loss: 0.7221 - val_acc: 0.8653
Epoch 99/100
50000/50000 [==============================] - 171s 3ms/step - loss: 0.1817 - acc: 1.0000 - val_loss: 0.7228 - val_acc: 0.8661
Epoch 100/100
50000/50000 [==============================] - 171s 3ms/step - loss: 0.1817 - acc: 1.0000 - val_loss: 0.7267 - val_acc: 0.8652
[INFO] evaluating network...
precision    recall  f1-score   support

airplane       0.86      0.89      0.87      1000
automobile       0.94      0.93      0.94      1000
bird       0.83      0.80      0.81      1000
cat       0.75      0.73      0.74      1000
deer       0.82      0.87      0.84      1000
dog       0.82      0.77      0.79      1000
frog       0.89      0.90      0.90      1000
horse       0.91      0.90      0.90      1000
ship       0.93      0.93      0.93      1000
truck       0.90      0.93      0.92      1000

micro avg       0.87      0.87      0.87     10000
macro avg       0.86      0.87      0.86     10000
weighted avg       0.86      0.87      0.86     10000

Figure 5: Experiment #3 demonstrates a step-based learning rate schedule (left). The training history accuracy/loss curves are shown on the right.

Figure 5 (left) visualizes our learning rate schedule. Notice how after every 15 epochs our learning rate drops, creating the “stair-step”-like effect.

Figure 5 (right) demonstrates the classic signs of step-based learning rate scheduling — you can clearly see our:

1. Training/validation loss decrease
2. Training/validation accuracy increase

…when our learning rate is dropped.

This is especially pronounced in the first two drops (epochs 15 and 30), after which the drops become less substantial.

This type of steep drop is a classic sign of a step-based learning rate schedule being utilized — if you see that type of training behavior in a paper, publication, or another tutorial, you can be almost sure that they used step-based decay!

Getting back to our accuracy, we’re now at 86-87% accuracy, an improvement from our first experiment.

#### Experiment #4: Linear learning rate schedule results

Let’s try using a linear learning rate schedule with Keras by setting

power=1.0
:

$python train.py --schedule linear --lr-plot output/lr_linear_schedule.png --train-plot output/train_linear_schedule.png [INFO] using 'linear' learning rate decay... [INFO] loading CIFAR-10 data... Epoch 1/100 50000/50000 [==============================] - 187s 4ms/step - loss: 2.0399 - acc: 0.4541 - val_loss: 1.6900 - val_acc: 0.5789 Epoch 2/100 50000/50000 [==============================] - 171s 3ms/step - loss: 1.4623 - acc: 0.6588 - val_loss: 1.4535 - val_acc: 0.6557 Epoch 3/100 50000/50000 [==============================] - 171s 3ms/step - loss: 1.1790 - acc: 0.7480 - val_loss: 1.2633 - val_acc: 0.7230 ... Epoch 98/100 50000/50000 [==============================] - 171s 3ms/step - loss: 0.1025 - acc: 1.0000 - val_loss: 0.5623 - val_acc: 0.8804 Epoch 99/100 50000/50000 [==============================] - 171s 3ms/step - loss: 0.1021 - acc: 1.0000 - val_loss: 0.5636 - val_acc: 0.8800 Epoch 100/100 50000/50000 [==============================] - 171s 3ms/step - loss: 0.1019 - acc: 1.0000 - val_loss: 0.5622 - val_acc: 0.8808 [INFO] evaluating network... precision recall f1-score support airplane 0.88 0.91 0.89 1000 automobile 0.94 0.94 0.94 1000 bird 0.84 0.81 0.82 1000 cat 0.78 0.76 0.77 1000 deer 0.86 0.90 0.88 1000 dog 0.84 0.80 0.82 1000 frog 0.90 0.92 0.91 1000 horse 0.91 0.91 0.91 1000 ship 0.93 0.94 0.93 1000 truck 0.93 0.93 0.93 1000 micro avg 0.88 0.88 0.88 10000 macro avg 0.88 0.88 0.88 10000 weighted avg 0.88 0.88 0.88 10000 Figure 6: Linear learning rate decay (left) applied to ResNet on CIFAR-10 over 100 epochs with Keras. The training accuracy/loss curve is displayed on the right. Figure 6 (left) shows that our learning rate is decreasing linearly over time while Figure 6 (right) visualizes our training history. We’re now seeing a sharper drop in both training and validation loss, especially past approximately epoch 75; however, note that our training loss is dropping significantly faster than our validation loss — we may be at risk of overfitting. Regardless, we are now obtaining 88% accuracy on our data, our best result thus far. #### Experiment #5: Polynomial learning rate schedule results As a final experiment let’s apply polynomial learning rate scheduling with Keras by setting power=5 : $ python train.py --schedule poly --lr-plot output/lr_poly_schedule.png --train-plot output/train_poly_schedule.png
[INFO] using 'polynomial' learning rate decay...
Epoch 1/100
50000/50000 [==============================] - 186s 4ms/step - loss: 2.0470 - acc: 0.4445 - val_loss: 1.7379 - val_acc: 0.5576
Epoch 2/100
50000/50000 [==============================] - 171s 3ms/step - loss: 1.4793 - acc: 0.6448 - val_loss: 1.4536 - val_acc: 0.6513
Epoch 3/100
50000/50000 [==============================] - 171s 3ms/step - loss: 1.2080 - acc: 0.7332 - val_loss: 1.2363 - val_acc: 0.7183
...
Epoch 98/100
50000/50000 [==============================] - 171s 3ms/step - loss: 0.1547 - acc: 1.0000 - val_loss: 0.6960 - val_acc: 0.8581
Epoch 99/100
50000/50000 [==============================] - 171s 3ms/step - loss: 0.1547 - acc: 1.0000 - val_loss: 0.6883 - val_acc: 0.8596
Epoch 100/100
50000/50000 [==============================] - 171s 3ms/step - loss: 0.1548 - acc: 1.0000 - val_loss: 0.6942 - val_acc: 0.8601
[INFO] evaluating network...
precision    recall  f1-score   support

airplane       0.86      0.89      0.87      1000
automobile       0.94      0.94      0.94      1000
bird       0.78      0.80      0.79      1000
cat       0.75      0.70      0.73      1000
deer       0.83      0.86      0.84      1000
dog       0.81      0.78      0.79      1000
frog       0.86      0.91      0.89      1000
horse       0.92      0.88      0.90      1000
ship       0.94      0.92      0.93      1000
truck       0.91      0.92      0.91      1000

micro avg       0.86      0.86      0.86     10000
macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000

Figure 7: Polynomial-based learning decay results using Keras.

Figure 7 (left) visualizes the fact that our learning rate is now decaying according to our polynomial function while Figure 7 (right) plots our training history.

This time we obtain ~86% accuracy.

### Commentary on learning rate schedule experiments

Our best experiment was from our fourth experiment where we utilized a linear learning rate schedule.

But does that mean we should always use a linear learning rate schedule?

No, far from it, actually.

The key takeaway here is that for this:

• Particular dataset (CIFAR-10)
• Particular neural network architecture (ResNet)
• Initial learning rate of 1e-2
• Number of training epochs (100)

…is that linear learning rate scheduling worked the best.

No two deep learning projects are alike so you will need to run your own set of experiments, including varying the initial learning rate and the total number of epochs, to determine the appropriate learning rate schedule (additional commentary is included in the “Summary” section of this tutorial as well).

### Do other learning rate schedules exist?

Other learning rate schedules exist, and in fact, any mathematical function that can accept an epoch or batch number as an input and returns a learning rate can be considered a “learning rate schedule”. Two other learning rate schedules you may encounter include (1) exponential learning rate decay, as well as (2) cyclical learning rates.

I don’t often use exponential decay as I find that linear and polynomial decay are more than sufficient, but you are more than welcome to subclass the

LearningRateDecay
class and implement exponential decay if you so wish.

Cyclical learning rates, on the other hand, are very powerful — we’ll be covering cyclical learning rates in a tutorial later in this series.

### How do I choose my initial learning rate?

You’ll notice that in this tutorial we did not vary our learning rate, we kept it constant at

1e-2
.

When performing your own experiments you’ll want to combine:

1. Learning rate schedules…
2. …with different learning rates

Don’t be afraid to mix and match!

The four most important hyperparameters you’ll want to explore, include:

1. Initial learning rate
2. Number of training epochs
3. Learning rate schedule
4. Regularization strength/amount (L2, dropout, etc.)

Finding an appropriate balance of each can be challenging, but through many experiments, you’ll be able to find a recipe that leads to a highly accurate neural network.

If you’d like to learn more about my tips, suggestions, and best practices for learning rates, learning rate schedules, and training your own neural networks, refer to my book, Deep Learning for Computer Vision with Python.

Figure 8: Deep Learning for Computer Vision with Python is a deep learning book for beginners, practitioners, and experts alike.

Today’s tutorial introduced you to learning rate decay and schedulers using Keras. To learn more about learning rates, schedulers, and how to write custom callback functions, refer to my book, Deep Learning for Computer Vision with Python.

Inside the book I cover:

1. More details on learning rates (and how a solid understanding of the concept impacts your deep learning success)
2. How to spot under/overfitting on-the-fly with a custom training monitor callback
3. How to checkpoint your models with a custom callback
4. My tips/tricks, suggestions, and best practices for training CNNs

Besides content on learning rates, you’ll also find:

• Super practical walkthroughs that present solutions to actual, real-world image classification, object detection, and instance segmentation problems.
• Hands-on tutorials (with lots of code) that not only show you the algorithms behind deep learning for computer vision but their implementations as well.
• A no-nonsense teaching style that is guaranteed to help you master deep learning for image understanding and visual recognition.

## Summary

In this tutorial, you learned how to utilize Keras for learning rate decay and learning rate scheduling.

Specifically, you discovered how to implement and utilize a number of learning rate schedules with Keras, including:

• The decay schedule built into most Keras optimizers
• Step-based learning rate schedules
• Linear learning rate decay
• Polynomial learning rate schedules

After implementing our learning rate schedules we evaluated each on a set of experiments on the CIFAR-10 dataset.

Our results demonstrated that for an initial learning rate of

1e-2
, the linear learning rate schedule, decaying over
100
epochs, performed the best.

However, this does not mean that a linear learning rate schedule will always outperform other types of schedules. Instead, all this means is that for this:

• Particular dataset (CIFAR-10)
• Particular neural network architecture (ResNet)
• Initial learning rate of
1e-2
• Number of training epochs (
100
)

…that linear learning rate scheduling worked the best.

No two deep learning projects are alike so you will need to run your own set of experiments, including varying the initial learning rate, to determine the appropriate learning rate schedule.

I suggest you keep an experiment log that details any hyperparameter choices and associated results, that way you can refer back to it and double-down on experiments that look promising.

Do not expect that you’ll be able to train a neural network and be “one and done” — that rarely, if ever, happens. Instead, set the expectation with yourself that you’ll be running many experiments and tuning hyperparameters as you go along. Machine learning, deep learning, and artificial intelligence as a whole are iterative — you build on your previous results.

Later in this series of tutorials I’ll also be showing you how to select your initial learning rate.

The post Keras learning rate schedules and decay appeared first on PyImageSearch.

### Ethical Data Sensemaking

Simply stated, data sensemaking is what we do to make sense of data. We do this in an attempt to understand the world, based on empirical evidence. Those who work to make sense of data and communicate their findings are data sensemakers. Data sensemaking, as a profession, is currently associated with several job titles, including data analyst, business intelligence professional, statistician, and data scientist. Helping people understand the world based on data is important work. Without understanding, we often make bad decisions. When done well, data sensemaking requires a broad and deep set of skills and a commitment to ethical conduct. When data sensemaking professionals fail to do their jobs well, whether through a lack of skills or other ethical misconduct, confusion and misinformation results, which encourages bad decisions—decisions that do harm. Making sense of data is not ethically or morally neutral; it can be done for good or ill. “I did what I was told” is not a valid excuse for unethical behavior.

In recent years, misuses of data have led to a great deal of discussion about ethics related to invasions of privacy and discriminatory uses of data. Most of these discussions focus on the creation and use of analytical algorithms. I’d like to extend the list of ethical considerations to address the full range of data sensemaking activities. The list of ethical practices that I’m proposing below is neither complete nor sufficiently organized nor fully described. I offer it only as an initial effort that we can discuss, expand, and clarify. Once we’ve done that, we can circle back and refine the work.

The ethical practices that can serve as a code of conduct for data sensemaking professionals are, in my opinion, built upon a single fundamental principle. It is the same principle that medical doctors swear as an oath before becoming licensed: Do no harm.

Here’s the list:

1. You should work, not just to provide information, but to enable understanding that can be used in beneficial ways.
2. You should develop the full range of skills that are needed to do the work of data sensemaking effectively. Training in a data analysis tool is not sufficient. This suggests the need for an agreed-upon set of skills for data sensemaking.
3. You should understand the relevant domain. For instance, if you’re doing sales analysis, you should understand the sales process as well as the sales objectives of your organization. When you don’t understand the domain well enough, you must involve those who do.
4. You should know your audience (i.e., your clients; those who are asking you to do the work)—their interests, beliefs, values, assumptions, biases, and objectives—in part to identify potentially unethical inclinations.
5. You should understand the purpose for which your work will be used. In other words, you should ask “Why?”.
6. You should strive to anticipate the ways in which your findings could be used for harm.
7. When asked to do something harmful, you should say “No.” Furthermore, you should also discourage others from doing harm.
8. When you discover harmful uses of data, you should challenge them, and if they persist, you should expose them to those who can potentially end them.
9. You should primarily serve the needs of those who will be affected by your work, which is not necessarily those who have asked you to do the work.
10. You should not examine data that you or your client have no right to examine. This includes data that is private, which you have not received explicit permission to examine. To do this, you must acquaint yourself with data privacy laws, but not limit yourself to concern only for data that has been legally deemed private if it seems reasonable that it should be considered private nonetheless.
11. You should not do work that will result in the unfair and discriminatory treatment of particular groups of people based on race, ethnicity, gender, religion, age, etc.
12. If you cannot enable the understanding that’s needed with the data that’s available, you should point this out, identify what’s needed, and do what you can to acquire it.
13. If the quality of the data that’s available is insufficient for the data sensemaking task, you should point this out, describe what’s lacking, and insist that the data’s quality be improved to the level that’s required before proceeding.
14. You should always examine data within context.
15. You should always examine data from all potentially relevant perspectives.
16. You should present your findings clearly.
17. You should present your findings as comprehensively as necessary to enable the level of understanding that’s needed.
18. You should present your findings truthfully.
19. You should describe the uncertainty of your findings.
20. You should report any limitations that might have had an effect on the validity of your findings.
22. You should solicit feedback during the data sensemaking process and invite others to critique your findings.
23. You should document the steps that you took, including the statistics that you used, and maintain the data that you produced during the course of your work. This will make it possible for others to review your work and for you to reexamine your findings at a later date.
24. When you’re asked to do work that doesn’t make sense or to do it in a way that doesn’t make sense (i.e., in ways that are ineffective), you should propose an alternative that does make sense and insist on it.
25. When people telegraph what they expect you to find in the data, you should do your best to ignore those expectations or to subject them to scrutiny.
As data sensemakers, we stand at the gates of understanding. Ethically, it is our job to serve as gatekeepers. In many cases, we will be the only defense against harm.

I invite you to propose additions to this list and to discuss the merits of the practices that I’ve proposed. If you are part of an organization that employs other data sensemakers, I also invite you to discuss the ethical dimensions of your work with one another.

### Alison Mattek on physics and psychology, philosophy, models, explanations, and formalization

Alison Mattek writes:

I saw your recent blog post on falsifiable claims. For the past couple of years I have been developing a theoretical framework that highlights the importance of unfalsifiable claims in science. I try to also make a few unfalsifiable claims regarding psychological variables.

Here is Mattek’s paper, “Expanding psychological theory using system analogies.” It reminds me a bit of the writings of Paul Meehl.

This month’s challenge was to find a dataset that makes sense to visualize in circular form—including but not limited to chord diagrams, coxcomb charts, polar area diagrams, radar plots, or sunbursts—and share what you learned about the process. We received over 50 examples of radial charts, with topics including exercise, climate change, activities performed during a 12 or 24-hour period, and music as the Tableau community recently finished creating 2019 Iron Viz qualifiers.

Participants were largely united in their evaluation of this challenge—finding a subject that fits a circular view is hard! Even harder was evaluating whether the final radial graph was effective at all. Many were quick to point out that they’d be inclined to choose a different type of graph, but appreciated the spirit of the exercise and realized they had learned something in the process. Our intent with the #SWDchallenge is to give the community a platform to practice trying something new in a low-risk environment. While this challenge seemed to push most of us outside of our comfort zones, that discomfort itself is immensely valuable for what it teaches us about our work and ourselves. Without experimentation, we never discover what unexpected approach might be applicable in an uncommon setting, or what technical skills we might enjoy developing. Conversely, it also gives us insight into what attempts are better suited for the intentional discard.

The key lesson from this month is that is that evaluating our own work can be a challenge—particularly when designing graphs for someone other than ourselves. In our workshops, we demonstrate one technique to step outside of your own head: solicit feedback from a friend or colleague. Show them your visual and have them talk you though their reactions. To which elements do they pay attention? What questions do they still have? This feedback process gives valuable input if your graph is working as you intended—and where to focus your rework if needed.

While radial charts may be aesthetically pleasing—humans are naturally drawn to circles and curves—circular visuals are often in conflict with the effective communication of data. If your goal is an attention-getting chart (without much concern over whether the data gets transformed into knowledge or action) then a radial chart may suffice. But keep in mind that you are more likely creating visual art than effective data visualization.

For those of you who did submit examples, THANK YOU for taking the time to create and share your work. The entries below are posted in alphabetical order by first name. If you tweeted or thought you submitted one but don't see it here, upload your submission as a .png here and we'll work to include any late entries this week (tweeting on its own isn't enough—we don't have time to scrape Twitter for entries.)

The next monthly challenge will launch on August 1st. Until then, check out the archives of previous challenges on our #SWDchallenge page. Happy practicing!

Showcasing a Radial Bar chart for this months SWDChallenge, deciding to present Michael Jackson's UK Peak charting singles positions, data sourced from Official Charts.com
Blog﻿

I used the data from FiveThirtyEight's Halloween candy ranking story to create this sunburst diagram. The original story included a regression analysis of each candy attribute (chocolate, fruit, crispy, etc.), but the sunburst diagram allows the comparison of different attribute combinations, not just each one separately.

## Allison

At first, I wasn't sure if I would attempt this month's challenge because I was nervous about creating a radial visualization. As someone who only opened Tableau for the first time in February 2019, is entirely self-taught, and hasn't thought about trigonometry since high school, I knew I had some work to do!

I began by considering what types of visualizations lend themselves well to this type of view. I tend to almost always err on the side of a simple bar chart to convey the big idea, but I decided to keep an open mind. I also wanted to tell a story about something that would be relatable and interesting. That’s when I thought about climate change. I realized that if I look at global warming over time, I could use a radial visualization to see interesting spikes and trends. I read through a bunch of tutorials and practiced creating radial charts (Kevin Flerlage’s tutorial was especially helpful). I’m not naturally the most mathematically inclined, so I spent way too long playing around with the radius, but I finally made it look right!

Past the radial viz, I wanted to provide a context of what this means for the world currently as well as in the future. I researched implications for continued warming of the world and shared outcomes that scientists think are likely if we reach a 2 degree Celsius warming. I also shared steps individuals can take to help combat climate change (however I really think governments and corporations need to make changes as well). Overall this was a fun challenge that helped me acquire lots of new skills!

## Amanda

Radial charts proved to be a challenge for me, but I appreciate this blog pushing me to try something totally new. I am a sports fanatic, so when I found QB data online, I was excited to put it to use.

## Anand

This is a network visualization connecting Directors and Companies in the Tata Group (public information). When the network stabilizes, it is interesting to see that it neatly segregates into two groups of companies. There are many common directors, but the network is clearly split. This was quite an interesting pattern to see.

## Andy

Great fun. I've never built a radial chart before. I took 5 years of my music listening data and started with a matrix of Weekday/Year. In each cell, I drew a radial bar chart - one bar for each hour in the day. The longer the bar? The more tracks I listened to. I'm pleased with the outcome, but the static version is not readable. I could add some clever annotations to help. The interactive version does allow for insight. I added highlighting and a tutorial overlay to assist the interactive explorer. A bar chart version is way more readable, but even I have to admit the radial version is more appealing to the eye!
Blog

## Angie

I was inspired by the Rhythm of Food example to look at how food interests change over time. Non-dairy milks are of special interest to me as my daughter has a milk allergy and will soon be at the age where she *would* be moving on to cows milk, but we will need to find an alternative, and I also am trying to better understand the environmental impact of the different non-dairy milks. Oat milk seems to frequently be out of stock at my local Whole Foods so it was no surprise to me that it was a very hot "milk" in terms of search, but I did not realize it was nearing the same level as soy milk. I did find the Dec to Jan spike across all milks quite interesting and unexpected. I used Excel to create the chart.

## Arthur

Radial charts include many different types of of charts and I selected two of them for this challenge: the radial bar chart and nested pie chart. A radial bar chart, also known as a circular bar chart, displays the history of life on earth and provides context of the relative time periods for each form of live. The nested pie chart, rendered as a donut chart to display the earth image within, outlines the specific time periods and eras in which life came to be. Combined, they form an infographic where the sum of the parts not only convey the data, but also together represent the earth as it's own circular chart visual. The chart has been kept clean of labels and utilizes dynamic tooltips to show additional information the viewer is interested in on hover or tap. You can view the live radial chart to explore this interactivity directly.

## Brian

Maximum duration of western hemisphere eclipses for years 2000-2099 grouped by month:

## Catherine B

It was hard for me to find a subject with data that could be well represented with a circular chart. I had to search for a long time. I think that they can be used only few times: insights will rarely be best represented with those kind of charts. Here, I chose a pie chart. Yes, a PIE CHART! Even if billions of people think we should never use them. In this particular case, I believe it was appropriate because I added a clock to make it clear that we are talking about a moment of the day. Not only the pie chart shows part of a whole, but it also represents a specific time of the day. Did I convince you to sometimes use a pie chart? I have to give credit to Udemy for this helpful video. Ok, let's drink a coffee… but only if it is between 9:30 am and 11:30 am !
Interactive viz

## Catherine W

I used R to create a circle packed chart showing the breakdown for different weather patterns by season in Seattle. Larger circles indicate more of that weather pattern.
Blog

## Chris

Finding a subject was a real challenge, I'm not a big fan of radial charts and forcing something to be radial for the sake of it seemed wrong. I used time in the end because it felt the most natural. Others have commented on how difficult it is to represent a 24hr clock in a single circle - I opted to stick with 24hrs in 360 degrees due to the difficulty of visualising the data any other way. I'm not sure how easy the chart is to interpret but I love the patterns and visual appeal of the result. I think this appeal overrides a lot of the downsides of the radial chart and the difficulties in reading it.

## Claire

Although I was intimidated by this month's challenge (I'd never plotted data on a radial chart before), it ended up being a textbook demonstration of the benefits of data viz. I chose a dataset at random (Chicago public transportation ridership for 2017) and plotted the values. They were disappointingly consistent over the course of the year, which made for a pretty boring radial chart. I decided to instead plot the percentage change in ridership (compared to average weekday and weekend values), and I was excited to see that the chart - and the data - were suddenly much more interesting! I could see the outliers, and I could point to them to tell a story about what was happening in Chicago in 2017.

## Daniel

I used Tableau to visualize the major killer / victim combinations across the entire Game of Thrones 7 seasons. Each dot represents a unique Killer and victim combination and is sized by the actual number of deaths for this combination.

## Dennis

The idea for this visual was born about two years ago at one of my customers. At that time the idea didn't work out the way it was supposed to, but I think it does now. Having data that never takes more than 60 minutes (the thing that went wrong 2 years ago) makes it easy to read the results. To make sure the attention is focused on today I use gray colors for all other parts.

## Eddie

A prototype of a radial view to display the sister city connections.

## Edwin

I created this radial graph just to show the different eras of Boy Bands throughout the years. Each line on the graph represents a single boy band and the length of the line represents how high their peak song got on the billboard charts. I intentionally distorted the order of the chart to switch up on the coloring of the chart for an aesthetic look but serves no purpose in helping the chart overall.

## Ela

I wanted to find out if there was any correlation between geographical orientation and working hours for European countries in 2018.
Interactive viz

## Elvira

I wanted to compare amount as well as categories of my expenses in each month. The data starts from July 2018, that's why January is not on top :)

## Franck

It is a remake of a dataviz featured in Visual Capitalist's article Animation: Global Population by Region From 1950 to 2100, which was inspired by geographer Simon Kuestenmacher. For this remake, the main idea was to make it more connected to our Earth (as I found the original pie chart and bar chart lacking this feeling). So I use a disk. The increase of this disk encodes the increase of the global population over years. In this disk, the area of each cell encodes the population of its corresponding region.

## Frans

Short of time creating a brand new radial dataviz for this months #swdchallenge, so I tweaked my 'Does size matter' viz slightly.

## Georgios

I find concentric circles fascinating and the first thing that came to mind was to make something that would look like the rings in the Apple Watch activity app.

## Hanna

During the #MakeoverMonday with game of thrones data I came up with an idea of creating radial bars - each bar representing a single episode in the series - that I could then place around the throne. In a way the bars would be an extension of the swords that the throne is made of. I'm not sure this qualifies for the challenge as it is not fully radial, but I thought this method fitted perfectly with the topic.

## Hesham

This is my first SWD challenge submission. This visualisation made in Tableau and the data preparation was made in Alteryx. It allows the user to interact with the visualisation by clicking on their Zodiac sign, doing so the user get to know the history of their sign and when they are most likely to see their constellation and also how their constellation looks like.
Interactive viz

## James

A radial stacked bar chart looking at English Cricket's Greatest batsmen - I love radial charts!

## Jared

The inspiration for this viz comes from the many hours I've spent playing Tekken 7 with my daughter during the school holidays. It turns out she's a bit of a natural!

## Jerome

My utilization illustrates the number of pickups throughout the year of 2016 based on all 365 days, and the 24 hours within those days.

## Johanie

I made my first chord diagram with R for this challenge. I used Statistic Canada data for irrigated area. It's not so easy to play with the aesthetics for this type of graph because it's doesn't work as ggplot but it still possible to do something good. Code is on my blog.

## Kate

I was excited at first by this challenge, then after a few days felt uninspired. Lucky for me R.J. Andrews of Info We Trust’s weekly data viz inspiration email hit my inbox on July 3rd, and I found the spark. (If you haven’t signed up for these yet- highly recommend, they’ve all been delightful). R.J. highlighted a chart authored by Thomas Jefferson which displayed the fruit and vegetable availability of a local Washington D.C. farmers market from 1801-1808. I re-designed this into a radial viz and added modern seasonal availability data of the vegetables Jefferson recorded. I had fun learning about some of the veggies that have fallen out of “fashion” and playing around with 19th-century design elements. I write more about my lessons learned in my blog.

## Lance

I found this challenge particularly difficult. I am not sure what other platforms are like creating radial charts but I use Tableau and I found creating this type of chart in Tableau a little convoluted! While I appreciate my entry is not as aesthetically pleasing as pretty much all of the examples provided, as a first attempt I am relatively happy. Despite the difficulties it was a great learning experience completing this challenge and well that's what this process is all about!

## Leah

I believe time data plots well in a radial graph representing a 24-hour analog clock. I used Seattle collision data to create a high-level view of accident trends throughout the day. I experimented quite a bit with use of color, finally deciding to let the graph (and white space) speak for itself. This plot was created using R.

## Ligia

The chart was done on PowerPoint to demonstrate that the data visualization techniques can be used in all tools all the time. The charts show in a simple way how unemployment come impacting the Brazilian people.

## Lisa

For this month's challenge, I was fortunate to find quite a few helpful blog posts and video tutorials on building radial charts in Tableau. As I was getting happily lost in the learning process, time was running out, so I kept the challenge simple by creating a radial bar chart. I chose to tell a clear story about the inequality of life expectancy of women around the world. Although not a chart I typically use, I can now see how radial bar charts can work effectively to catch the eye and attention of an audience.

## Liz

I had done this type of visualization only once before, so it took me some time to find data that would fit. I decided to focus on international travel since I recently traveled to France and the Netherlands. I found this information from the US Travel Association and used Tableau to make the visualization.

## Lori

For this month's challenge, I tried to think of data that naturally occur in circles. I thought of a clock face, and then located some data about toddlers' sleep times. I don't really have fancy data viz technology, but I created this in Excel + PowerPoint. If someone had the technology, I think it would be awesome to be able to hover to see the specific sleep and wake times.

## Matthew

One of my favorite video games is FIFA. This visualization is a breakdown of my favorite team MN United. I have never used Sunburst charts, and am not a fan of Pie or Donut. This did take me out of my comfort zone.

## Michael

This is my first #SWDchallenge. I was crunched for time but wanted to give it a go. I have never produced a radial chart before so a lot of time went into learning a new technique in Tableau. My story telling suffered as a result but excited to continue to improve my SWD skills.

## Neil

I wanted to combine visualising music with a radial chart - given the cyclical nature of Pachelbel's canon there are so many ways I could do this. In the end I settle on this small multiple version visualising each of the 28 sections - we see how the motifs change and follow each other in sequence while the basso profundo part never once changes.
Interactive version synced with music | Website

## Pris

Heartbreak can feel endless cycle --Wake up, feel, think too much, and repeat. I chose to rework an old side project mapping out how often the heartbroken consulted their modern day seer, Google. Using Google Trends data, I tried to recreate the somber time loop experience using a radial bar chart. Through the accompanying line chart, bar chart and text, I wanted to highlight the spikes in searches occurring throughout the day.

## Rahul

This was a chart built by me a year ago. I saw this chart on Tableau first by Adam Crahen (@acrahen) and Robreto Reif (@robertoreif) article helped me to create this chart.
Interactive viz | Data source | Website

## Robert

I decided to first look at nice circular graphs and which requirements they would pose for the data. I saw a very nice circular graph in the rawgraphs gallery, made by Frederik Ruys called The rise and flow of political parties. It was so nice that I immediately decided to go with that type of graph. I soon found out that rawgraphs does not offer this as a standard graph. They have a standard bump chart which is horizontal and you have to make it circular yourself. I think in the future rawgraphs will offer it as a standard graph type, it definitely should be. After that I looked through Gapminder for a dataset. As I was kind of in a hurry I decided to go with the first dataset that would meet the requirements posed by the chart, which was the fertility dataset.

## Romina

Marvel superheroes stats and info, it is a visualization about the characters.
Website

## Samo

Visualization for all on-screen kills across eight seasons of Game of Thrones where I choose a radial chart to communicate seasons total kills and episodes break down.

Website

## Simon B

For my radial viz I decided to encode country and winners through a sunburst and then when titles were wine through radial circles, sized by decade titles. My aim was to show when players and countries dominated the Wimbledon Championships.
Interactive viz | Website

## Simon R

Used inspiration from a blog by Bora Beran to have a go at a Coxcomb chart showing average monthly minimum and maximum temperatures for the city of London. Quite a steep learning curve and need to go back and drill into all the table calcs to create it but pleased with the look.

## Vasa

A makeover of a segment of a visual story done in 2013, this time focusing on time spent online in our learning management system. It aims to show that e-learning helps students engage with the material even after class time.

## Vijaya

Kiva.org is an online crowdfunding platform to extend financial services to poor and financially excluded people around the world. The radial bar is one of the apt charts for visualizing the loan sanctions by the Kiva organization based on the sector of the loan borrowers. This representation of the data gives a bird-eye-view of the most and least sectors that has loan approvals. Data Source Credit: kaggle

## Vinodh

A look at the wildfires in USA from 1980-2018. California is the most wildfire prone state and 90% of it is human triggered.

## Yvan

I made this visualization few months ago to track my yearly distance goal, by month and by week. Interactive visualization

## Zak

This viz represents daily temperature, precipitation, and severe weather events between June 1, 2017 and June 1, 2019. I downloaded data from the Brackett Creek SNOTEL station in the Bridger Mountains outside of Bozeman, MT. This visualization was created entirely in R.

Click ♥ if you've made it to the bottom—this helps us know that the time it takes to pull this together is worthwhile! Check out the #SWDchallenge page for more. Thanks for reading!

### Airport runway orientation reveals wind patterns

Airport runways orient certain directions that correlate with wind direction in the area. It helps planes land and take off more easily. So, when you map runways around the world, you also get wind patterns, which is what Figures did:

Winds circulate around the globe, forming patterns of gigantic proportions. These patterns become part of human culture and are reflected in our architecture. They are hidden designs, mapping the complexion of the earth, which we can uncover. By orienting on the direction of general winds, airports recreate wind patterns, forming a representation of a global wind map with steel and stone, thus making the invisible visible.

Tags: , ,

### Things I Learned From the SciPy 2019 Lightning Talks

This post summarizes the interesting aspects of the Day One of the SciPy 2019 lightning talks, a flash round of a dozen ~3 minute talks covering a wide variety of topics.

### Top Stories, Jul 15-21: The Death of Big Data and the Emergence of the Multi-Cloud Era; Bayesian deep learning and near-term quantum computers

Also: Dealing with categorical features in machine learning; Computer Vision for Beginners: Part 1; Big Data for Insurance; A Summary of DeepMind's Protein Folding Upset at CASP13; The Hackathon Guide for Aspiring Data Scientists

### Take your RStudio Hotkeys Online with a Text Expander

(This article was first published on George J. Mount, and kindly contributed to R-bloggers)

As a blogger and curriculum developer, I am often writing about R from a text editor instead of RStudio. I feared that working outside RStudio meant saying goodbye to fantastic hotkeys it provides. Isn’t there a way, I wondered, to set up a keyboard shortcut on my computer so that, for example, Alt + - would always return  <- ?

I put out the call on LinkedIn (Let’s connect if we haven’t!), and fortunately my friend Mike Cantrell (a new data analyst himself) had the answer: a text expander.

I consider myself a productivity geek, but this is a new application to me. A favorite of lawyers and technical writers, text expanders allow you to create custom “hotkeys” of varying length and complexity. Turning tcby into The Country's Best Yogurt, for example, is a nice early 90’s example of what text expanders can do.

### New to R? Get your course checklist here.

But, readers, I know you’re more hungry for R productivity than frozen yogurt, so below I’ll walk through how to set up the assignment operator and pipe shortcuts instead.

A bit of research showed a rather competitive market for text expanders, and I landed on PhraseExpress because it’s free for personal use and works on a variety of platforms and operating systems.

### 2. Add a new phrase

The PhraseExpress interface is fairly straightforward, at least for the very basic task we are performing. On the main menu, click Phrases > New Phrase.

### 3. Describe and assign the shortcuts

By default, PhraseExpress will drop your Description attributes into Phrase content. That’s not the best option for our example: instead, complete the former with R assignment operator and the latter with <- . Be sure to include spaces before and after the assignment operator.

Now you can assign this phrase to the Alt + - hotkey in the bottom window of this screen:

One more while we’re here: Let’s make one for %>%, the “pipe” operator. Its RStudio shortcut is Ctrl + Shift + M.

### 4. Code away!

Adding these to PhraseExpress, we can now use these shortcuts anywhere on our device.

Where else to practice this but in the console of my very own DataCamp course?

### 5. Watch for conflicts

Keep in mind that these shortcuts will follow to all applications on your computer. So, you will run into conflict with any programs using Alt + - or Ctrl + Shift + M as keyboard shortcuts.

Have you used text expanders before? If so, how? What advice would you give to new users?

### Ready to take your R journey? Check out my comprehensive course, R Explained for Excel Users.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### If you did not already know

Contextual Bilateral Loss (CoBi)
This paper shows that when applying machine learning to digital zoom for photography, it is beneficial to use real, RAW sensor data for training. Existing learning-based super-resolution methods do not use real sensor data, instead operating on RGB images. In practice, these approaches result in loss of detail and accuracy in their digitally zoomed output when zooming in on distant image regions. We also show that synthesizing sensor data by resampling high-resolution RGB images is an oversimplified approximation of real sensor data and noise, resulting in worse image quality. The key barrier to using real sensor data for training is that ground truth high-resolution imagery is missing. We show how to obtain the ground-truth data with optically zoomed images and contribute a dataset, SR-RAW, for real-world computational zoom. We use SR-RAW to train a deep network with a novel contextual bilateral loss (CoBi) that delivers critical robustness to mild misalignment in input-output image pairs. The trained network achieves state-of-the-art performance in 4X and 8X computational zoom. …

Uncertainty-Aware Feature Selection (UAFS)
Missing data are a concern in many real world data sets and imputation methods are often needed to estimate the values of missing data, but data sets with excessive missingness and high dimensionality challenge most approaches to imputation. Here we show that appropriate feature selection can be an effective preprocessing step for imputation, allowing for more accurate imputation and subsequent model predictions. The key feature of this preprocessing is that it incorporates uncertainty: by accounting for uncertainty due to missingness when selecting features we can reduce the degree of missingness while also limiting the number of uninformative features being used to make predictive models. We introduce a method to perform uncertainty-aware feature selection (UAFS), provide a theoretical motivation, and test UAFS on both real and synthetic problems, demonstrating that across a variety of data sets and levels of missingness we can improve the accuracy of imputations. Improved imputation due to UAFS also results in improved prediction accuracy when performing supervised learning using these imputed data sets. Our UAFS method is general and can be fruitfully coupled with a variety of imputation methods. …

Deep Feature Fusion-Audio and Text Modal Fusion (DFF-ATMF)
Sentiment analysis research has been rapidly developing in the last decade and has attracted widespread attention from academia and industry, most of which is based on text. However, the information in the real world usually comes as different modalities. In this paper, we consider the task of Multimodal Sentiment Analysis, using Audio and Text Modalities, proposed a novel fusion strategy including Multi-Feature Fusion and Multi-Modality Fusion to improve the accuracy of Audio-Text Sentiment Analysis. We call this the Deep Feature Fusion-Audio and Text Modal Fusion (DFF-ATMF) model, and the features learned from it are complementary to each other and robust. Experiments with the CMU-MOSI corpus and the recently released CMU-MOSEI corpus for Youtube video sentiment analysis show the very competitive results of our proposed model. Surprisingly, our method also achieved the state-of-the-art results in the IEMOCAP dataset, indicating that our proposed fusion strategy is also extremely generalization ability to Multimodal Emotion Recognition. …

Distance Metric Learned Collaborative Representation Classifier (DML-CRC)
Any generic deep machine learning algorithm is essentially a function fitting exercise, where the network tunes its weights and parameters to learn discriminatory features by minimizing some cost function. Though the network tries to learn the optimal feature space, it seldom tries to learn an optimal distance metric in the cost function, and hence misses out on an additional layer of abstraction. We present a simple effective way of achieving this by learning a generic Mahalanabis distance in a collaborative loss function in an end-to-end fashion with any standard convolutional network as the feature learner. The proposed method DML-CRC gives state-of-the-art performance on benchmark fine-grained classification datasets CUB Birds, Oxford Flowers and Oxford-IIIT Pets using the VGG-19 deep network. The method is network agnostic and can be used for any similar classification tasks. …

### Seeking Reproducibility within Social Science: Search and Discovery

Julia Lane, NYU Professor, Economist and cofounder of the Coleridge Initiative, presented “Where’s the Data: A New Approach to Social Science Search & Discovery” at Rev. Lane described the approach that the Coleridge Initiative is taking to address the science reproducibility challenge. The approach is to provide remote access for government analysts and researchers to confidential data in a secure data facility and to build analytical capacity and collaborations through an Applied Data Analytics training program.  This article provides a distilled summary and a written transcript of Lane’s talk at Rev. Many thanks to Julia Lane for providing feedback on this post prior to publication.

# Session Summary

Science is facing a research reproducibility challenge that hampers data scientists’ and researchers’ ability to accelerate their work and provide insights that impact their organizations. Since Domino’s inception, we have tackled the reproducibility problem to support our customers via continued updates to the platform’s collaboration functionality as well as contributing to the overall public discourse on this blog and at industry events including Rev. In the Rev session, “Where’s the Data: A New Approach to Social Science Search & Discovery”, Julia Lane provided insights into how the Coleridge Initiative is addressing the reproducibility challenge by providing secure remote access to confidential data for government analysts and researchers and to build analytical capacity and collaborations through an Applied Data Analytics training program. The goal is to enable government agencies to make better evidence-based policy decisions with better quality data. Lane, staying true to the ethos of reproducibility, covered how the approach could be used to allow approved analysts to reuse and ingest insights to accelerate their work. Lane discussed the questions Coleridge sought to answer; to improve the analytical rigor associated with working with linked data; how Coleridge built and tested a set of tools that could be applied to identify what data have been used to address different research questions; and how the approach can be used to inform new researchers as they access the secure environment with new projects. Lane closed the session with how the intention of initiative “is to build evidence-based policy making whereby we get better knowledge, get better policy, get resources allocated better, and reduce the cost and burden of collecting information.”

A few highlights from the session include

• how modern approaches can be used to obtain better quality data, and at a lower cost, to help support evidence-based policy decisions.
• unique challenges of sharing confidential government microdata and the importance of access in generating high quality inference
• pragmatic evaluation of risk-utility tradeoffs, or the tradeoff in risk resulting from greater data usage versus the risk of disclosure or reidentification
• a discussion of how Coleridge used the training classes to build the capacity of government agency staff to address data quality challenges
• the importance of pairing training for researchers with access in a secure environment.

Additional insights are available in the written transcript.

# Transcript

Julia Lane:

I’m an economist by training and that is where my focus is going to be. Let me tell you the story. I’m at NYU, as you can tell by my strong New York accent, and I’ve spent most of my career working with federal statistical agencies, the agencies that bring you the Decennial Census, the unemployment rate, GDP, and so on. A major challenge that you face with using those kinds of data for decision making is data is very expensive to collect and the quality of the data is going down.

How many of you are familiar with the fact that the Decennial Census is going to be fielded next year? Okay. Roughly how many people do you think there are in the United States? 350, 360 million, we’ll find out more maybe next year. How much do you think it costs to collect…count that number of people? $360 million. Try north of that.$2 billion, do I hear two? Okay. Anyone want to go higher? Higher, more, more. $17 billion, okay, to count the number of people and ask them 10, maybe 11 questions. Right? The challenge that we have is that the quality of the data is going down. People aren’t responding, they’re giving bad responses, and so on. About three years ago, Evidence-Based Policymaking commission was established, which brought together experts from around the country to figure out how can you develop data to make decisions at better quality and lower cost,. They came up with a set of recommendations, that got passed into the Evidence-Based Policymaking Act this year. They asked NYU to build a s to build an environment to inform the deliberations of the Commission, and show how you can bring together data from multiple sources and make sense of them at the level of quality that you need to be able to allocate resources to make decisions. That’s why it’s called evidence-based policy. Of course, the challenge with bringing data together is that the data, particularly on human beings, is quite complicated. When it is generated from multiple different sources, it’s not just a matter of mashing them together. You need to understand how they were generated, what the issues are and so on. Together with that, we built training classes to train government agency staff and researchers and analysts who want to work with the data what the issues…how to work with them. Here’s the challenge. When you’re working with complicated data and when it is confidential, because it is data on human beings, it’s very difficult to find out what work has been done on them before. Every time someone starts working with the data, it’s a tabula rasa. You can’t figure out what’s going on. The challenge that we had to solve is the challenge I’m going to talk to you about today, which is how do you find, when you land on a dataset that is newly generated, that isn’t curated in any way …. how do I figure out what’s in there and how do I figure out who else has worked with it? And if you think about it, this is an amazon.com problem, right? The reason Jeff Bezos made so much money is that he solved the problem of figuring out what was in books, figuring out by information that was generated by the way other people who’ve use the book rather than just the way that people had produced the book. What we wanted to do was to build an amazon.com for data, and that’s the story I’m going to talk about today, and give you some sense about the platform that we’re trying to build. What am I talking about with the new types of data? Back in the day, data used to be generated by someone filling out a survey form, right? A statistical agency collected it, curated it, documented it, sent it out. Or it could be administrative records, records that are generated from the administration of a government program like tax data. Nowadays, we’re also looking at the new types of data, dataset generated by sensors, generated by your card swipes, like retail trade data, Mastercard or IRI data or your DNA. These are complicated datasets. They don’t come nicely curated, but they can add a lot of understanding about how to allocate policy. That data needs to be shared. The challenge that we face is when it’s confidential data on human beings, there are, quite rightly, many prohibitions on sharing the knowledge. Everything is much more siloed than in the open data world that you’re used to dealing with. For example, as this slide shows, the commissioners in the city of Baltimore will get together every time a child dies to share information about all the government programs that they’ve touched. It might be housing, education, welfare, foster care and so on. But the only time they share the knowledge is when the kid is dead. What we’re trying to do here is to build a knowledge infrastructure that enables us to share the knowledge before children die in ways that can improve policy. But the challenge is, and this is a risk-utility tradeoff, that the value of working with confidential data is that the more people and the more use it gets, the better off the policy is, but also the greater the risk of disclosure. What you have to do is you have to try and manage that disclosure. If you can, you really can build better policy. New Zealand is a good exampleAs these slides, developed by the former prime minister, Bill English, show. Like most countries or cities, you know, there’s three big areas of expenditure: education, health, pensions. And what you want to do, if you want to allocate resources a little bit better, is to use the integrated data generated from multiple government programs a little bit better. What do I mean by integrated data, for example? Here’s a kid, and the age of the kid is along here, down the bottom here is the cost to the taxpayer, not just other things. You look at how this kid hits Children, Youth, Family Services, Abuse, Foster Care, Education, Youth Justice, Income Support or Welfare, right? Kid gets born, by age about two and a half showing up with Children, Youth, Family Services and notifications of abuse. You start seeing the kid here, more abuse, more visits from Family Services. Starts education, by about 9, 10, 11, spotty education, gets taken into care then by 17, he hits into the Youth Justice, and then goes into income support. Pretty predictable if you put that information together. I don’t need to belabor the issues that are associated with it here. That kind of information, if you put it together and understand it, can help allocate resources. This is getting a school certificate, education qualification. If the kid gets it by age 18, now, in pretty good shape for the future. If they don’t get it, it’s a pretty bad indicator. You can rank, based on the data, you can rank the likelihood of a kid achieving, not achieving this school certificate by age 18, and you can figure out who are the kids who are at the highest risk of not getting the qualification. If you allocate resources away from, you know, kids like my kids, they don’t need interventions. They don’t need the kind of services that these kids need. You can reallocate funding and you can tremendously reduce the cost to both the taxpayer and transform those children’s lives. It is this type of vision that led to the Evidence-Based Policymaking Act. The big thing is, is how do you put the data together securely, in a clearinghouse? That the kind of work that I’m talking about can be implemented by government analysts and by researchers in a secure way so that the risk of re-identification is minimized. We built a clearinghouse. which we called the Administrative Data Research Facility. I don’t like the term clearinghouse for any number of reasons. It has to be program…mission specific, so I prefer the term facility. But the key thing was not just building the clearinghouse but also building a training program that worked with it. I’m not going to go into too much detail about it, but the basic idea is you’re going to have a secure environment. And then of course, a major challenge is that you have to have telemetry to figure out who’s accessing it. But you also need to have metadata around the data. You need to have a rich context because if it’s just zeros and ones, I have no clue what it is. And many of you who worked with open data will have observed that the open data can be a bit dodgy in terms of quality. Part of the challenge is the way in which the data was generated, it was just kind of…there’s vomit of data that was put together and summarized and doesn’t really have any cables going back to the microdata engine. And what you really need is you want the data users to be to be involved in the metadata documentation. In other words, again, drawing the analogy with the statistical system, the way in which the agencies generate data for use as human beings who create metadata documentation very painstakingly, and you get a report on what all the variables mean, how they were generated and so on. Highly manual process. What we want to do is we wanted to generate an automated way of finding out what information is in the data and what’s the quality of that information. Just let me give you a flavor of what those is. In one of the classes we have data on ex-offenders. The programmatic question is, is what’s the impact of access to jobs and neighborhood characteristics on the earnings and employment outcomes of ex-offenders, tand their subsequent recidivism of. Here’s where we’re in…why this is all on [inaudible 00:14:09], They’re going back and forth. Wouldn’t it be great to have that tacit knowledge codified so that as people start working with the data, the metadata documentation is generated automatically, like amazon.com. Right? That’s the basic idea. You know, instead of you saying, “Where’s the data coming from and how was it documented just based on the way it was produced,” you’ve got the community telling you something about what the data is about. Okay. Here’s my challenge. I want to figure out, when I land on a dataset, who else has worked with the data on what topics and for what results. And then I want to generate a community that’s going to contribute knowledge. It’s kind of an amazon.com for data. Okay. How am I going to build a machine that’s going to do that? Remember the statistical agencies, and I slammed them at the beginning, but, you know, these are great people. These are hardworking, wonderful human beings who have great motivation, but they’re like a pre-Industrial Revolution data factory. Now what we’re trying to do is we’re trying to build a modern data factory, a modern approach to automate the generation of the metadata. Essentially, what we’re going to try and figure out how to do is we are going to scope the question, pose it to the computer science community, natural language processing, machine learning community, and say, “Can you figure out how you can learn from it, automate it, rinse and repeat?” And here’s the core insight. We’re interested in who has worked with this data before, and identify them and then figure out what they did with it. If you think about it, all of that knowledge is embedded in publications, either published work or working papers or government reports. It’s in a document somewhere. In that document, if it’s empirical, someone has said, “Here’s my question, here’s what I’m going to do with it, and here’s the section that describes the data.” What I want to do is I want to tee the computer scientists up to help me figure out where is the dataset and where the semantic context that’s going to point me to that. Okay. And then I’m going to get the community at large to tell me whether they’ve done it right or wrong, and then fix it from there. Essentially, one of the communities we’re working with U.S. Department of Agriculture. But anyway, USDA, you’ll see one of the things they look at is NHANES, the nutrition education dataset. You’ll see here, here they have something that says analytical sample. They say something about the data. What we want is for the computer scientists to go figure out where that data is. That’s what we did. We ran a competition. We took a hand-curated corpus. Social science data repositories, public use ones, sit in different places across the country. One is at University of Michigan, you have ICPSR. There are three people who, every day, their job is to read papers and say which one of the ICPSR datasets is in that corpus, manually. Then they write it down and then they put it up to say what’s been done. We took that corpus and we ran a competition. We had 20 teams from around the world compete. Twelve of them submitted code, four very kind people here helped advertise it, thank you for that. Then we had four finalists. The model that actually enable was…that we were amazed was they could actually do this. Think about it. If I hold up a publication and it… Can you tell me what the dataset is that’s referenced in there? And the answer is, of course, no. The baseline is zero. What the winning algorithm did, it correctly identified the dataset that was being cited in the publication 54% of the time. And that’s amazing, right? Now there’s a lot of work to be done yet. I’ve skipped over all the bugs and problems and on, but it’s still super encouraging, right? Because now, once I get that dataset to publication link, that gives me the rich context, that gives me the potential to find out everything else because there’s a lot of work that’s done on publications by my colleagues at Digital Science, UberResearch. They have linked to publications over the past 10 years: grants, policy documents, patents, clinical trials, and so on. Once I get that dyad, I’m off to the races, right? What that enables me to do is to figure out, for a particular publication, everything around it. That was my goal. Now there’s a lot of work that needs to be done around that. But, for example, again, this is a Dimensions website, and I’m not going to have time to go live because I’m going to get yelled at, but I could show you live, you can type in the dataset, and you get lots of related information. You find out who the researchers are, what the related topics are and so on. And that is going to give me the knowledge that I was looking for. It was a pretty buggy model, even though it was amazing. There’s, you know, whole amount of work that needs to be done on it. Now what we want to do is to go and get that dyad clean up. The biggest problem is, and you probably already figured this out, but the search that we had was on titled datasets. It’s things that were called American Community Survey or NHANES or PSID or something like that. Where there’s a lot of datasets that human beings work with, they’ll say, “Oh, we were working with LinkedIn data, we were working with Twitter data,” something that’s not labeled or with retail IRI, retail scanner data, it doesn’t actually have a title that you could go and find. We need much better knowledge from the semantic context. That means we need to develop a corpus, a tagged corpus that the machine learning algorithms can be trained on. We’re working with Dimensions in Digital Science, and we’re also trying to get human-curated input in a number of different ways. One is working with publishers where the [inaudible 00:22:01], when an author submits a publication and they say, “Can you give us some keywords,” we can tell them, “Give us some…tell us what datasets are in there.” Right? If they just tell us what datasets are in there, I’ve got a dyad right away. Right? And then I can train. When researchers are getting onboarded into a secure environment and then they are asking about what datasets are available, you could get them to contribute their knowledge as well. In the classes, we run 300 government analysts through who are subject matter experts, we could get them to tell us what they know about datasets that are available for the common good. If any of you are social scientists, please go to this and we’re asking our colleagues to just fill it. It turns out, you know, if we get a thousand well curated document with public datasets, that’s going to be enough to see the next iteration. This is where we go next. The Digital Science guys of the guys who brought us Altmetrics. It turns out that people really like…they go to that shiny little badge, the Altmetrics badge, and click on it. What we’re designing is if you type in now and look for a dataset, and we’re working with the Deutsche Bundesbank as well, it will then pull up the related publications. Then the idea here is every publication that’s pulled up, you get the dataset context that’s in there and it’s going to say how many experts, how many papers, how many code books, how many annotations there are associated with it. Then when you click on that, up pops more rich context, I can find all the papers, the other papers, experts, code books, annotations and related datasets. And then up here is a call for action, right? We don’t have to have everyone responding, but we’ll trial this out to see how well we do on that. And then of course, feed that into this approach. Then the last step, and this is work with Brian Granger and Fernando Perez, is to build it into Jupyter notebooks. One of the things that Fernando and Brian have been doing is they’re trying to make the workbooks more collaborative. Because right now, it’s just a single computational narrative. And they’re also trying to work with them, to be able to work with confidential microdata for all the reasons that I talked about. Here’s the basic idea. Currently, now, when you land in on a dataset, if you’re lucky, all you get is data that is generated by the way the dataset was produced. The analogy example is Jeff Bezos again. When you look for a book, what do you find? The ISBN number, the author, the title, the publisher, right? That’s the metadata, by the way, the book was produced. What you really want is knowledge about the data itself. I don’t have to go into a bookstore and find out, I can just find out who else like me has used the information. And again, we’re building this into Jupyter notebooks, we, the Jupyter team are, in conjunction with our team, and the notion here is remember the Slack communications that we had, build that in to annotations. Here’s Brian and Fernando just putting stuff in, but build that into the annotation that tacit knowledge gets codified and built into the graph model that underlies the data infrastructure. That’s the sweep of the story. We want to be able to build evidence-based policymaking whereby we get better knowledge, get better policy, get resources allocated better, and reduce the cost and the burden of collecting information. We started off with building a secure environment and building workforce capacity around it. We’re kind of at this stage right now. Where we want to head is build a platform. If you want more information, we’re hiring here in scenic New York, at NYU. A lot of the information is here and also on our website. You may wonder why it’s called the Coleridge Initiative. How many of you have heard of Samuel Taylor Coleridge? Great. Okay. Very famous for “The Rime of the Ancient Mariner,” right? We were trying to figure out what to call this thing, and we thought, “Data Science for the Public Good,” [blech] “Evidence-Based Policy,” [blech]. Coleridge Initiative seemed obvious, right? Rime of the Ancient Mariner, “Water, water everywhere, nor any drop to drink,” right? Here, it’s “Data, data everywhere, we have to stop and think.” That’s why it’s called the Coleridge Initiative. This transcript has been edited for readability. Continue Reading… ### Magister Dixit “And that’s where the statistician needs to take it easy: • Start with the results, so the audience has a clear view on the outcome • Proceed to explain the analysis simply and with a minimum of statistical jargon • Describe what an algorithm does, not the specifics of your killer algo • Visualize the inputs (e.g.: a correlation matrix showing an ‘influence heat map’) • Visualize the process (e.g.: a regression line on a chief predictor variable) • Visualize the results (e.g.: a lift chart to show how much the analysis is improving results) • Always, always tie each step back to the business challenge • Always be open to questions and feedback.” Andrew Pease ( November 3, 2014 ) Continue Reading… ### What is a bad chart? In the recent issue of Madolyn Smith’s Conversations with Data newsletter hosted by DataJournalism.com, she discusses “bad charts,” featuring submissions from several dataviz bloggers, including myself. What is a “bad chart”? Based on this collection of curated "bad charts", it is not easy to nail down “bad-ness”. The common theme is the mismatch between the message intended by the designer and the message received by the reader, a classic error of communication. How such mismatch arises depends on the specific example. I am able to divide the “bad charts” into two groups: charts that are misinterpreted, and charts that are misleading. Charts that are misinterpreted The Causes of Death entry, submitted by Alberto Cairo, is a “well-designed” chart that requires “reading the story where it is inserted and the numerous caveats.” So readers may misinterpret the chart if they do not also partake the story at Our World in Data which runs over 1,500 words not including the appendix. The map of Canada, submitted by Highsoft, highlights in green the provinces where the majority of residents are members of the First Nations. The “bad” is that readers may incorrectly “infer that a sizable part of the Canadian population is First Nations.” In these two examples, the graphic is considered adequate and yet the reader fails to glean the message intended by the designer. Charts that are misleading Two fellow bloggers, Cole Knaflic and Jon Schwabish, offer the advice to start bars at zero (here's my take on this rule). The “bad” is the distortion introduced when encoding the data into the visual elements. The Color-blindness pictogram, submitted by Severino Ribecca, commits a similar faux pas. To compare the rates among men and women, the pictograms should use the same baseline. In these examples, readers who correctly read the charts nonetheless leave with the wrong message. (We assume the designer does not intend to distort the data.) The readers misinterpret the data without misinterpreting the graphics. Using the Trifecta Checkup In the Trifecta Checkup framework, these problems are second-level problems, represented by the green arrows linking up the three corners. (Click here to learn more about using the Trifecta Checkup.) The visual design of the Causes of Death chart is not under question, and the intended message of the author is clearly articulated in the text. Our concern is that the reader must go outside the graphic to learn the full message. This suggests a problem related to the syncing between the visual design and the message (the QV edge). By contrast, in the Color Blindness graphic, the data are not under question, nor is the use of pictograms. Our concern is how the data got turned into figurines. This suggests a problem related to the syncing between the data and the visual (the DV edge). *** When you complain about a misleading chart, or a chart being misinterpreted, what do you really mean? Is it a visual design problem? a data problem? Or is it a syncing problem between two components? Continue Reading… ### Whats new on arXiv The rapid growth of data in velocity, volume, value, variety, and veracity has enabled exciting new opportunities and presented big challenges for businesses of all types. Recently, there has been considerable interest in developing systems for processing continuous data streams with the increasing need for real-time analytics for decision support in the business, healthcare, manufacturing, and security. The analytics of streaming data usually relies on the output of offline analytics on static or archived data. However, businesses and organizations like our industry partner Gnowit, strive to provide their customers with real time market information and continuously look for a unified analytics framework that can integrate both streaming and offline analytics in a seamless fashion to extract knowledge from large volumes of hybrid streaming data. We present our study on designing a multilevel streaming text data analytics framework by comparing leading edge scalable open-source, distributed, and in-memory technologies. We demonstrate the functionality of the framework for a use case of multilevel text analytics using deep learning for language understanding and sentiment analysis including data indexing and query processing. Our framework combines Spark streaming for real time text processing, the Long Short Term Memory (LSTM) deep learning model for higher level sentiment analysis, and other tools for SQL-based analytical processing to provide a scalable solution for multilevel streaming text analytics. Recently, collaborative robots have begun to train humans to achieve complex tasks, and the mutual information exchange between them can lead to successful robot-human collaborations. In this paper we demonstrate the application and effectiveness of a new approach called \textit{mutual reinforcement learning} (MRL), where both humans and autonomous agents act as reinforcement learners in a skill transfer scenario over continuous communication and feedback. An autonomous agent initially acts as an instructor who can teach a novice human participant complex skills using the MRL strategy. While teaching skills in a physical (block-building) ($n=34$) or simulated (Tetris) environment ($n=31$), the expert tries to identify appropriate reward channels preferred by each individual and adapts itself accordingly using an exploration-exploitation strategy. These reward channel preferences can identify important behaviors of the human participants, because they may well exercise the same behaviors in similar situations later. In this way, skill transfer takes place between an expert system and a novice human operator. We divided the subject population into three groups and observed the skill transfer phenomenon, analyzing it with Simpson’s psychometric model. 5-point Likert scales were also used to identify the cognitive models of the human participants. We obtained a shared cognitive model which not only improves human cognition but enhances the robot’s cognitive strategy to understand the mental model of its human partners while building a successful robot-human collaborative framework. Principles of cognitive economy would require that concepts about objects, properties and relations should be introduced only if they simplify the conceptualisation of a domain. Unexpectedly, classic logic conditionals, specifying structures holding within elements of a formal conceptualisation, do not always satisfy this crucial principle. The paper argues that this requirement is captured by \emph{supervenience}, hereby further identified as a property necessary for compression. The resulting theory suggests an alternative explanation of the empirical experiences observable in Wason’s selection tasks, associating human performance with conditionals on the ability of dealing with compression, rather than with logic necessity. Improving the accuracy and robustness of deep neural nets (DNNs) and adapting them to small training data are primary tasks in deep learning research. In this paper, we replace the output activation function of DNNs, typically the data-agnostic softmax function, with a graph Laplacian-based high dimensional interpolating function which, in the continuum limit, converges to the solution of a Laplace-Beltrami equation on a high dimensional manifold. Furthermore, we propose end-to-end training and testing algorithms for this new architecture. The proposed DNN with graph interpolating activation integrates the advantages of both deep learning and manifold learning. Compared to the conventional DNNs with the softmax function as output activation, the new framework demonstrates the following major advantages: First, it is better applicable to data-efficient learning in which we train high capacity DNNs without using a large number of training data. Second, it remarkably improves both natural accuracy on the clean images and robust accuracy on the adversarial images crafted by both white-box and black-box adversarial attacks. Third, it is a natural choice for semi-supervised learning. For reproducibility, the code is available at \url{https://…/DNN-DataDependentActivation}. Interpretable Machine Learning (IML) has become increasingly important in many applications, such as autonomous cars and medical diagnosis, where explanations are preferred to help people better understand how machine learning systems work and further enhance their trust towards systems. Particularly in robotics, explanations from IML are significantly helpful in providing reasons for those adverse and inscrutable actions, which could impair the safety and profit of the public. However, due to the diversified scenarios and subjective nature of explanations, we rarely have the ground truth for benchmark evaluation in IML on the quality of generated explanations. Having a sense of explanation quality not only matters for quantifying system boundaries, but also helps to realize the true benefits to human users in real-world applications. To benchmark evaluation in IML, in this paper, we rigorously define the problem of evaluating explanations, and systematically review the existing efforts. Specifically, we summarize three general aspects of explanation (i.e., predictability, fidelity and persuasibility) with formal definitions, and respectively review the representative methodologies for each of them under different tasks. Further, a unified evaluation framework is designed according to the hierarchical needs from developers and end-users, which could be easily adopted for different scenarios in practice. In the end, open problems are discussed, and several limitations of current evaluation techniques are raised for future explorations. Knowledge tracing is the task of modeling each student’s mastery of knowledge concepts (KCs) as (s)he engages with a sequence of learning activities. Each student’s knowledge is modeled by estimating the performance of the student on the learning activities. It is an important research area for providing a personalized learning platform to students. In recent years, methods based on Recurrent Neural Networks (RNN) such as Deep Knowledge Tracing (DKT) and Dynamic Key-Value Memory Network (DKVMN) outperformed all the traditional methods because of their ability to capture complex representation of human learning. However, these methods face the issue of not generalizing well while dealing with sparse data which is the case with real-world data as students interact with few KCs. In order to address this issue, we develop an approach that identifies the KCs from the student’s past activities that are \textit{relevant} to the given KC and predicts his/her mastery based on the relatively few KCs that it picked. Since predictions are made based on relatively few past activities, it handles the data sparsity problem better than the methods based on RNN. For identifying the relevance between the KCs, we propose a self-attention based approach, Self Attentive Knowledge Tracing (SAKT). Extensive experimentation on a variety of real-world dataset shows that our model outperforms the state-of-the-art models for knowledge tracing, improving AUC by 4.43% on average. Recommender systems are crucial to alleviate the information overload problem in online worlds. Most of the modern recommender systems capture users’ preference towards items via their interactions based on collaborative filtering techniques. In addition to the user-item interactions, social networks can also provide useful information to understand users’ preference as suggested by the social theories such as homophily and influence. Recently, deep neural networks have been utilized for social recommendations, which facilitate both the user-item interactions and the social network information. However, most of these models cannot take full advantage of the social network information. They only use information from direct neighbors, but distant neighbors can also provide helpful information. Meanwhile, most of these models treat neighbors’ information equally without considering the specific recommendations. However, for a specific recommendation case, the information relevant to the specific item would be helpful. Besides, most of these models do not explicitly capture the neighbor’s opinions to items for social recommendations, while different opinions could affect the user differently. In this paper, to address the aforementioned challenges, we propose DSCF, a Deep Social Collaborative Filtering framework, which can exploit the social relations with various aspects for recommender systems. Comprehensive experiments on two-real world datasets show the effectiveness of the proposed framework. Recently, neural networks trained as optimizers under the ‘learning to learn’ or meta-learning framework have been shown to be effective for a broad range of optimization tasks including derivative-free black-box function optimization. Recurrent neural networks (RNNs) trained to optimize a diverse set of synthetic non-convex differentiable functions via gradient descent have been effective at optimizing derivative-free black-box functions. In this work, we propose RNN-Opt: an approach for learning RNN-based optimizers for optimizing real-parameter single-objective continuous functions under limited budget constraints. Existing approaches utilize an observed improvement based meta-learning loss function for training such models. We propose training RNN-Opt by using synthetic non-convex functions with known (approximate) optimal values by directly using discounted regret as our meta-learning loss function. We hypothesize that a regret-based loss function mimics typical testing scenarios, and would therefore lead to better optimizers compared to optimizers trained only to propose queries that improve over previous queries. Further, RNN-Opt incorporates simple yet effective enhancements during training and inference procedures to deal with the following practical challenges: i) Unknown range of possible values for the black-box function to be optimized, and ii) Practical and domain-knowledge based constraints on the input parameters. We demonstrate the efficacy of RNN-Opt in comparison to existing methods on several synthetic as well as standard benchmark black-box functions along with an anonymized industrial constrained optimization problem. Deep learning techniques have become the method of choice for researchers working on algorithmic aspects of recommender systems. With the strongly increased interest in machine learning in general, it has, as a result, become difficult to keep track of what represents the state-of-the-art at the moment, e.g., for top-n recommendation tasks. At the same time, several recent publications point out problems in today’s research practice in applied machine learning, e.g., in terms of the reproducibility of the results or the choice of the baselines when proposing new models. In this work, we report the results of a systematic analysis of algorithmic proposals for top-n recommendation tasks. Specifically, we considered 18 algorithms that were presented at top-level research conferences in the last years. Only 7 of them could be reproduced with reasonable effort. For these methods, it however turned out that 6 of them can often be outperformed with comparably simple heuristic methods, e.g., based on nearest-neighbor or graph-based techniques. The remaining one clearly outperformed the baselines but did not consistently outperform a well-tuned non-neural linear ranking method. Overall, our work sheds light on a number of potential problems in today’s machine learning scholarship and calls for improved scientific practices in this area. Source code of our experiments and full results are available at: https://…/RecSys2019_DeepLearning_Evaluation. We propose a quantum data fitting algorithm for non-sparse matrices, which is based on the Quantum Singular Value Estimation (QSVE) subroutine and a novel efficient method for recovering the signs of eigenvalues. Our algorithm generalizes the quantum data fitting algorithm of Wiebe, Braun, and Lloyd for sparse and well-conditioned matrices by adding a regularization term to avoid the over-fitting problem, which is a very important problem in machine learning. As a result, the algorithm achieves a sparsity-independent runtime of $O(\kappa^2\sqrt{N}\mathrm{polylog}(N)/(\epsilon\log\kappa))$ for an $N\times N$ dimensional Hermitian matrix $\bm{F}$, where $\kappa$ denotes the condition number of $\bm{F}$ and $\epsilon$ is the precision parameter. This amounts to a polynomial speedup on the dimension of matrices when compared with the classical data fitting algorithms, and a strictly less than quadratic dependence on $\kappa$. Numerosity perception is foundational to mathematical learning, but its computational bases are strongly debated. Some investigators argue that humans are endowed with a specialized system supporting numerical representation; others argue that visual numerosity is estimated using continuous magnitudes, such as density or area, which usually co-vary with number. Here we reconcile these contrasting perspectives by testing deep networks on the same numerosity comparison task that was administered to humans, using a stimulus space that allows to measure the contribution of non-numerical features. Our model accurately simulated the psychophysics of numerosity perception and the associated developmental changes: discrimination was driven by numerosity information, but non-numerical features had a significant impact, especially early during development. Representational similarity analysis further highlighted that both numerosity and continuous magnitudes were spontaneously encoded even when no task had to be carried out, demonstrating that numerosity is a major, salient property of our visual environment. In a steady-state evolution, tournament selection traditionally uses the fitness function to select the parents, and negative selection chooses an individual to be replaced with an offspring. This contribution focuses on analyzing the behavior, in terms of performance, of different heuristics when used instead of the fitness function in tournament selection. The heuristics analyzed are related to measuring the similarity of the individuals in the semantic space. In addition, the analysis includes random selection and traditional tournament selection. These selection functions were implemented on our Semantic Genetic Programming system, namely EvoDAG, which is inspired by the geometric genetic operators and tested on 30 classification problems with a variable number of samples, variables, and classes. The result indicated that the combination of accuracy and the random selection, in the negative tournament, produces the best combination, and the difference in performances between this combination and the tournament selection is statistically significant. Furthermore, we compare EvoDAG’s performance using the selection heuristics against 18 classifiers that included traditional approaches as well as auto-machine-learning techniques. The results indicate that our proposal is competitive with state-of-art classifiers. Finally, it is worth to mention that EvoDAG is available as open source software. We introduce natural adversarial examples — real-world, unmodified, and naturally occurring examples that cause classifier accuracy to significantly degrade. We curate 7,500 natural adversarial examples and release them in an ImageNet classifier test set that we call ImageNet-A. This dataset serves as a new way to measure classifier robustness. Like l_p adversarial examples, ImageNet-A examples successfully transfer to unseen or black-box classifiers. For example, on ImageNet-A a DenseNet-121 obtains around 2% accuracy, an accuracy drop of approximately 90%. Recovering this accuracy is not simple because ImageNet-A examples exploit deep flaws in current classifiers including their over-reliance on color, texture, and background cues. We observe that popular training techniques for improving robustness have little effect, but we show that some architectural changes can enhance robustness to natural adversarial examples. Future research is required to enable robust generalization to this hard ImageNet test set. The presumed data owners’ right to explanations brought about by the General Data Protection Regulation in Europe has shed light on the social challenges of explainable artificial intelligence (XAI). In this paper, we present a case study with Deep Learning (DL) experts from a research and development laboratory focused on the delivery of industrial-strength AI technologies. Our aim was to investigate the social meaning (i.e. meaning to others) that DL experts assign to what they do, given a richly contextualized and familiar domain of application. Using qualitative research techniques to collect and analyze empirical data, our study has shown that participating DL experts did not spontaneously engage into considerations about the social meaning of machine learning models that they build. Moreover, when explicitly stimulated to do so, these experts expressed expectations that, with real-world DL application, there will be available mediators to bridge the gap between technical meanings that drive DL work, and social meanings that AI technology users assign to it. We concluded that current research incentives and values guiding the participants’ scientific interests and conduct are at odds with those required to face some of the scientific challenges involved in advancing XAI, and thus responding to the alleged data owners’ right to explanations or similar societal demands emerging from current debates. As a concrete contribution to mitigate what seems to be a more general problem, we propose three preliminary XAI Mediation Challenges with the potential to bring together technical and social meanings of DL applications, as well as to foster much needed interdisciplinary collaboration among AI and the Social Sciences researchers. Continue Reading… ### Document worth reading: “Introduction to Multi-Armed Bandits” Multi-armed bandits a simple but very powerful framework for algorithms that make decisions over time under uncertainty. An enormous body of work has accumulated over the years, covered in several books and surveys. This book provides a more introductory, textbook-like treatment of the subject. Each chapter tackles a particular line of work, providing a self-contained, teachable technical introduction and a review of the more advanced results. The chapters are as follows: Stochastic bandits; Lower bounds; Bayesian Bandits and Thompson Sampling; Lipschitz Bandits; Full Feedback and Adversarial Costs; Adversarial Bandits; Linear Costs and Semi-bandits; Contextual Bandits; Bandits and Zero-Sum Games; Bandits with Knapsacks; Incentivized Exploration and Connections to Mechanism Design. Status of the manuscript: essentially complete (modulo some polishing), except for last chapter, which the author plans to add over the next few months. Introduction to Multi-Armed Bandits Continue Reading… ### Excel Report Generation with Shiny (This article was first published on Posts on Tychobra, and kindly contributed to R-bloggers) R is great for report generation. Shiny allows us to easily create web apps that generate a variety of reports with R. This post details a demo Shiny app that generates an Excel report, a PowerPoint report, and a PDF report: The full Shiny app source code is available here. Also, we included a more basic Shiny app that generates an Excel report at the end of this post. Follow up posts will include similar simple Shiny apps generating PowerPoint and PDF reports. Here are some screenshots of the Shiny app that generates the reports. You click the button in the left sidebar to select the report type (Excel, PowerPoint, or PDF). And then you click a the report type button to generate the report. Simple as that! In this app, the generated reports rely on the user selected “Valuation Date” Shiny input (in the left sidebar). Your report generating Shiny app would, of course, include your data and the Shiny inputs necessary for your custom report. ### Excel Report The generated Excel workbook has 3 sheets: The “Cover Page” sheet includes an image and some custom text and cell styling. The other 2 pages include tables with multi part headers, totals rows, and some custom styling. Follow the link below to download a copy of the actual generated Excel file: https://github.com/Tychobra/shiny-insurance-examples/blob/master/basic-insurer-dashboard/example%20reports/claims-report-as-of-2019-06-20.xlsx The process of creating and customizing the Excel workbook is handled by the openxlsx R package. We like openxlsx because it does not require Java (several of the other R packages for working with Excel depend on Java), and it provides functions to highly customize the Excel workbook. The following is a simple Shiny app that generates an Excel workbook. You can copy and paste this simple app into your R console to run it. Enjoy! library(shiny) library(openxlsx) # create some example data to download my_table <- data.frame( Name = letters[1:4], Age = seq(20, 26, 2), Occupation = LETTERS[15:18], Income = c(50000, 20000, 30000, 45000) ) # add a totals row my_table <- rbind( my_table, data.frame( Name = "Total", Age = NA_integer_, Occupation = "", Income = sum(my_table$Income)
)
)

# minimal Shiny UI
ui <- fluidRow(
column(
width = 12,
align = "center",
tableOutput("table_out"),
br(),
)
)
)

# minimal Shiny server
server <- function(input, output) {
output$table_out <- renderTable({ my_table }) output$download_excel <- downloadHandler(
filename = function() {
"employee_data.xlsx"
},
content = function(file) {
my_workbook <- createWorkbook()

wb = my_workbook,
sheetName = "Employee Data"
)

setColWidths(
my_workbook,
1,
cols = 1:4,
widths = c(6, 6, 10, 10)
)

writeData(
my_workbook,
sheet = 1,
c(
"Company Name",
"Employee Data"
),
startRow = 1,
startCol = 1
)

my_workbook,
sheet = 1,
style = createStyle(
fontSize = 24,
textDecoration = "bold"
),
rows = 1:2,
cols = 1
)

writeData(
my_workbook,
sheet = 1,
my_table,
startRow = 5,
startCol = 1
)

my_workbook,
sheet = 1,
style = createStyle(
fgFill = "#1a5bc4",
halign = "center",
fontColour = "#ffffff"
),
rows = 5,
cols = 1:4,
gridExpand = TRUE
)

my_workbook,
sheet = 1,
style = createStyle(
fgFill = "#7dafff",
numFmt = "comma"
),
rows = 6:10,
cols = 1:4,
gridExpand = TRUE
)

saveWorkbook(my_workbook, file)
}
)

}

shinyApp(ui, server)

The above app generates this neat little excel workbook:

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Is Congress rigged in favour of the rich?

A new study finds that partisanship matters more than the influence of the wealthy

## July 21, 2019

### Distilled News

<Please note that this post is for my own educational purpose.>
Many teams try to start an applied AI project by diving into algorithms and data before figuring out desired outputs and objectives. Unfortunately, that’s like raising a puppy in a New York City apartment for a few years, then being surprised that it can’t herd sheep for you.
Many companies have adapted to a ‘data-driven’ approach for operational decision-making. Data can improve decisions, but it requires the right processor to get the most from it. Many people assume that processor is human. The term ‘data-driven’ even implies that data is curated by – and summarized for – people to process. But to fully leverage the value contained in data, companies need to bring artificial intelligence (AI) into their workflows and, sometimes, get us humans out of the way. We need to evolve from data-driven to AI-driven workflows. Distinguishing between ‘data-driven’ and ‘AI-driven’ isn’t just semantics. Each term reflects different assets, the former focusing on data and the latter processing ability. Data holds the insights that can enable better decisions; processing is the way to extract those insights and take actions. Humans and AI are both processors, with very different abilities. To understand how best to leverage each its helpful to review our own biological evolution and how decision-making has evolved in industry. Just fifty to seventy five years ago human judgment was the central processor of business decision-making. Professionals relied on their highly-tuned intuitions, developed from years of experience (and a relatively tiny bit of data) in their domain, to, say, pick the right creative for an ad campaign, determine the right inventory levels to stock, or approve the right financial investments. Experience and gut instinct were most of what was available to discern good from bad, high from low, and risky vs. safe.
Training deep neural networks to achieve the best performance is a challenging task. In this post, I would be exploring the most common problems and their solutions. These problems include taking too long to train, vanishing and exploding gradients and initialization. All these problems are known as Optimization problems. Another category of issue that arises while training the network is Regularization Problem. I have discussed them in my previous post. If you haven’t already read it, you can read it by clicking the link below.
I would like to begin by asking the following question: ‘Can we trust the model predictions just because the model performance is convincingly high on the test data?’ Many people might answer this question as ‘Yes’. But this is not always true. High model performance should not be considered an indicator to trust the model predictions, as the signals being picked up by the model can be random and might not make business sense.
I remember when I was having my first overseas internship at CERN as a summer student, most people were still talking about the discovery of Higgs boson upon confirming that it met the ‘five sigma’ threshold (which means having p-value of 0.0000003). Back then I knew nothing about p-value, hypothesis testing or even statistical significance. And you’re right. I went to google the word – p-value, and what I found on Wikipedia made me even more confused… In statistical hypothesis testing, the p-value or probability value is, for a given statistical model, the probability that, when the null hypothesis is true, the statistical summary (such as the absolute value of the sample mean difference between two compared groups) would be greater than or equal to the actual observed results.
In this post, we will discuss how to implement different combinations of non-linear activation functions and weight initialization methods in python. Also, we will analyze how the choice of activation function and weight initialization method will have an effect on accuracy and the rate at which we reduce our loss in a deep neural network using a non-linearly separable toy data set. This is a follow-up post to my previous post on activation functions and weight initialization methods. Note: This article assumes that the reader has a basic understanding of Neural Network, weights, biases, and backpropagation. If you want to learn the basics of the feed-forward neural network, check out my previous article (Link at the end of this article).
Computers are great at working with structured data like spreadsheets and database tables. But us humans usually communicate in words, not in tables. That’s unfortunate for computers.
I remember the first time I saw a computer, it was a Power Macintosh 5260 (with Monkey Island on it). I was around 5 years old and I looked at it as if it belonged to another universe. It did, I was not allowed to get anywhere close to it within a 5 mile radius; it was my older brother’s! That did not stop me. I browsed it for hours. The possibilities of computers were infinite and fuelled by the inspiration of sci-fi worlds the dream of talking machines, machines that can assist humans, think themselves and even have feelings never stopped. I kept dreaming about the possibilities of the future.
1. Enter ‘Generation Kaggle’
2. Neural Networks are the cure to everything
3. Machine Learning is the Product
4. Confuse Causation with Correlation
5. Optimize the wrong metrics
The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source Unity plugin that enables games and simulations to serve as environments for training intelligent agents. Agents can be trained using reinforcement learning, imitation learning, neuroevolution, or other machine learning methods through a simple-to-use Python API. We also provide implementations (based on TensorFlow) of state-of-the-art algorithms to enable game developers and hobbyists to easily train intelligent agents for 2D, 3D and VR/AR games. These trained agents can be used for multiple purposes, including controlling NPC behavior (in a variety of settings such as multi-agent and adversarial), automated testing of game builds and evaluating different game design decisions pre-release. The ML-Agents toolkit is mutually beneficial for both game developers and AI researchers as it provides a central platform where advances in AI can be evaluated on Unity’s rich environments and then made accessible to the wider research and game developer communities.
Hello everyone. I’m so excited to be here at Compose, along so many enthusiastic and some very advanced functional programmers. I live a dual life. At day, I teach computers how to think about code more deeply and, at night, I teach people how to think about code more deeply. So, this is the talk I’ve been really excited about the for last year; this is hands down the coolest thing I learned in the year of 2018. I was just reading this paper about programming language semantics and it was like, ‘Oh, these two things look completely different! Here’s how they’re the same, you do this.’ I was like: wait, what was that? What? It explains so many changes that I see people like myself already do. This got all of one slide in my web course I teach, but now I’ll get a chance to really explain why it’s so cool You are all here to learn.
Are you fascinated by the amount of text data available on the internet? Are you looking for ways to work with this text data but aren’t sure where to begin? Machines, after all, recognize numbers, not the letters of our language. And that can be a tricky landscape to navigate in machine learning.
1. Tokenization using Python’s split() function
2. Tokenization using Regular Expressions (RegEx)
3. Tokenization using NLTK
4. Tokenization using the spaCy library
5. Tokenization using Keras
6. Tokenization using Gensim
1. Machine Learning Use Cases in Smartphones
• Voice Assistants
• Smartphone Cameras
• App Store and Play Store Recommendations
• Face Unlock – Smartphones
2. Machine Learning Use Cases in Transportation
• Dynamic Pricing in Travel
• Transportation and Commuting – Uber
3. Machine Learning Use Cases in Popular Web Services
• Email filtering
4. Machine Learning Use Cases in Sales and Marketing
• Recommendation Engines
• Personalized Marketing
• Customer Support Queries (and Chatbots)
6. Machine Learning Use Cases in Security
• Video Surveillance
7. Machine Learning Use Cases in the Financial Domain
• Catching Fraud in Banking
• Personalized Banking
8. Other Popular Machine Learning Use Cases
• Self-Driving Cars
Ed Lorenz was a genius at coming up with simple models that capture the essence of a problem in a much more complex system. His famous butterfly model from 1963 jump-started chaos research, followed by more sophisticated models to describe upscale error growth (1969) and the general circulation of the atmosphere (1984). In 1995, he created another chaotic mode that shall be the topic of this blog post. Confusingly, even though the original paper appeared in 1995, most people refer to the model as the Lorenz 96 (L96) model, which we will also do here.
We’re not far from the day when artificial intelligence will provide us with a paintbrush for reality. As the foundations we’ve relied upon lose their integrity, many people find themselves afraid of what’s to come. But we’ve always lived in a world where our senses misrepresent reality. New technologies will help us get closer to the truth by showing us where we can’t find it. From a historical viewpoint, we’ve never successfully stopped the progression of any technology and owe the level of safety and security we enjoy to that ongoing progression. While normal accidents do occur and the downsides of progress likely won’t ever cease to exist, we make the problem worse when trying to fight the inevitable. Besides, reality has never been as clear and accurate as we want to believe. We fight against new technology because we believe it creates uncertainty when, more accurately, it only shines a light on the uncertainty that’s always existed and we’ve preferred to ignore.

### Let’s get it right

At the end of June, Motherboard reported on a new app called DeepNude, which promised – ‘with a single click’ – to transform a clothed photo of any woman into a convincing nude image using machine learning. In the weeks since this report, the app has been pulled by its creator and removed from GitHub, though open source copies have surfaced there in recent days. Most of the coverage of DeepNude has focused on the specific dangers posed by its technical advances. ‘DeepNude is an evolution of that technology that is easier to use and faster to create than deepfakes,’ wrote Samantha Cole in Motherboard’s initial report on the app. ‘DeepNude also dispenses with the idea that this technology can be used for anything other than claiming ownership over women’s bodies.’ With its promise of single-click undressing of any woman, it made it easier than ever to manufacture naked photos – and, by extension, to use those fake nudes to harass, extort, and publicly shame women everywhere. But even following the app’s removal, there’s a lingering problem with DeepNude that goes beyond its technical advances and ease of use. It’s something older and deeper, something far more intractable – and far harder to erase from the internet – than a piece of open source code.
We recount in this essay the decade-long story of Gram Vaani, a social enterprise with a vision to build appropriate ICTs (Information and Communication Technologies) for participatory media in rural and low-income settings, to bring about social development and community empowerment. Other social enterprises will relate to the learning gained and the strategic pivots that Gram Vaani had to undertake to survive and deliver on its mission, while searching for a robust financial sustainability model. While we believe the ideal model still remains elusive, we conclude this essay with an open question about the reason to differentiate between different kinds of enterprises – commercial or social, for-profit or not-for-profit – and argue that all enterprises should have an ethical underpinning to their work.
With a view towards understanding why undesirable outcomes often arise in ICT projects, we draw attention to three aspects in this essay. First, we present several examples to show that incorporating an ethical framework in the design of an ICT system is not sufficient in itself, and that ethics need to guide the deployment and ongoing management of the projects as well. We present a framework that brings together the objectives, design, and deployment management of ICT projects as being shaped by a common underlying ethical system. Second, we argue that power-based equality should be incorporated as a key underlying ethical value in ICT projects, to ensure that the project does not reinforce inequalities in power relationships between the actors directly or indirectly associated with the project. We present a method to model ICT projects to make legible its influence on the power relationships between various actors in the ecosystem. Third, we discuss that the ethical values underlying any ICT project ultimately need to be upheld by the project teams, where certain factors like political ideologies or dispersed teams may affect the rigour with which these ethical values are followed. These three aspects of having an ethical underpinning to the design and management of ICT projects, the need for having a power-based equality principle for ICT projects, and the importance of socialization of the project teams, needs increasing attention in today’s age of ICT platforms where millions and billions of users interact on the same platform but which are managed by only a few people.
The presumed data owners’ right to explanations brought about by the General Data Protection Regulation in Europe has shed light on the social challenges of explainable artificial intelligence (XAI). In this paper, we present a case study with Deep Learning (DL) experts from a research and development laboratory focused on the delivery of industrial-strength AI technologies. Our aim was to investigate the social meaning (i.e. meaning to others) that DL experts assign to what they do, given a richly contextualized and familiar domain of application. Using qualitative research techniques to collect and analyze empirical data, our study has shown that participating DL experts did not spontaneously engage into considerations about the social meaning of machine learning models that they build. Moreover, when explicitly stimulated to do so, these experts expressed expectations that, with real-world DL application, there will be available mediators to bridge the gap between technical meanings that drive DL work, and social meanings that AI technology users assign to it. We concluded that current research incentives and values guiding the participants’ scientific interests and conduct are at odds with those required to face some of the scientific challenges involved in advancing XAI, and thus responding to the alleged data owners’ right to explanations or similar societal demands emerging from current debates. As a concrete contribution to mitigate what seems to be a more general problem, we propose three preliminary XAI Mediation Challenges with the potential to bring together technical and social meanings of DL applications, as well as to foster much needed interdisciplinary collaboration among AI and the Social Sciences researchers.
Introduction: To improve current public health strategies in suicide prevention and mental health, governments, researchers and private companies increasingly use information and communication technologies, and more specifically Artificial Intelligence and Big Data. These technologies are promising but raise ethical challenges rarely covered by current legal systems. It is essential to better identify, and prevent potential ethical risks. Objectives: The Canada Protocol – MHSP is a tool to guide and support professionals, users, and researchers using AI in mental health and suicide prevention. Methods: A checklist was constructed based upon ten international reports on AI and ethics and two guides on mental health and new technologies. 329 recommendations were identified, of which 43 were considered as applicable to Mental Health and AI. The checklist was validated, using a two round Delphi Consultation. Results: 16 experts participated in the first round of the Delphi Consultation and 8 participated in the second round. Of the original 43 items, 38 were retained. They concern five categories: ‘Description of the Autonomous Intelligent System’ (n=8), ‘Privacy and Transparency’ (n=8), ‘Security’ (n=6), ‘Health-Related Risks’ (n=8), ‘Biases’ (n=8). The checklist was considered relevant by most users, and could need versions tailored to each category of target users.
Online participatory media platforms that enable one-to-many communication among users, see a significant amount of user generated content and consequently face a problem of being able to recommend a subset of this content to its users. We address the problem of recommending and ranking this content such that different viewpoints about a topic get exposure in a fair and diverse manner. We build our model in the context of a voice-based participatory media platform running in rural central India, for low-income and less-literate communities, that plays audio messages in a ranked list to users over a phone call and allows them to contribute their own messages. In this paper, we describe our model and evaluate it using call-logs from the platform, to compare the fairness and diversity performance of our model with the manual editorial processes currently being followed. Our models are generic and can be adapted and applied to other participatory media platforms as well.
The ethical implications and social impacts of artificial intelligence have become topics of compelling interest to industry, researchers in academia, and the public. However, current analyses of AI in a global context are biased toward perspectives held in the U.S., and limited by a lack of research, especially outside the U.S. and Western Europe. This article summarizes the key findings of a literature review of recent social science scholarship on the social impacts of AI and related technologies in five global regions. Our team of social science researchers reviewed more than 800 academic journal articles and monographs in over a dozen languages. Our review of the literature suggests that AI is likely to have markedly different social impacts depending on geographical setting. Likewise, perceptions and understandings of AI are likely to be profoundly shaped by local cultural and social context. Recent research in U.S. settings demonstrates that AI-driven technologies have a pattern of entrenching social divides and exacerbating social inequality, particularly among historically-marginalized groups. Our literature review indicates that this pattern exists on a global scale, and suggests that low- and middle-income countries may be more vulnerable to the negative social impacts of AI and less likely to benefit from the attendant gains. We call for rigorous ethnographic research to better understand the social impacts of AI around the world. Global, on-the-ground research is particularly critical to identify AI systems that may amplify social inequality in order to mitigate potential harms. Deeper understanding of the social impacts of AI in diverse social settings is a necessary precursor to the development, implementation, and monitoring of responsible and beneficial AI technologies, and forms the basis for meaningful regulation of these technologies.
Failure to account for human values in software (e.g., equality and fairness) can result in user dissatisfaction and negative socio-economic impact. Engineering these values in software, however, requires technical and methodological support throughout the development life cycle. This paper investigates to what extent software engineering (SE) research has considered human values. We investigate the prevalence of human values in recent (2015 – 2018) publications at some of the top-tier SE conferences and journals. We classify SE publications, based on their relevance to different values, against a widely used value structure adopted from social sciences. Our results show that: (a) only a small proportion of the publications directly consider values, classified as relevant publications; (b) for the majority of the values, very few or no relevant publications were found; and (c) the prevalence of the relevant publications was higher in SE conferences compared to SE journals. This paper shares these and other insights that motivate research on human values in software engineering.

### Finding out why

We develop tools for utilizing correspondence experiments to detect illegal discrimination by individual employers. Employers violate US employment law if their propensity to contact applicants depends on protected characteristics such as race or sex. We establish identification of higher moments of the causal effects of protected characteristics on callback rates as a function of the number of fictitious applications sent to each job ad. These moments are used to bound the fraction of jobs that illegally discriminate. Applying our results to three experimental datasets, we find evidence of significant employer heterogeneity in discriminatory behavior, with the standard deviation of gaps in job-specific callback probabilities across protected groups averaging roughly twice the mean gap. In a recent experiment manipulating racially distinctive names, we estimate that at least 85% of jobs that contact both of two white applications and neither of two black applications are engaged in illegal discrimination. To assess the tradeoff between type I and II errors presented by these patterns, we consider the performance of a series of decision rules for investigating suspicious callback behavior under a simple two-type model that rationalizes the experimental data. Though, in our preferred specification, only 17% of employers are estimated to discriminate on the basis of race, we find that an experiment sending 10 applications to each job would enable accurate detection of 7-10% of discriminators while falsely accusing fewer than 0.2% of non-discriminators. A minimax decision rule acknowledging partial identification of the joint distribution of callback rates yields higher error rates but more investigations than our baseline two-type model. Our results suggest illegal labor market discrimination can be reliably monitored with relatively small modifications to existing audit designs.
We derive the functional form of mutual information (MI) from a set of design criteria and a principle of maximal sufficiency. The (MI) between two sets of propositions is a global quantifier of correlations and is implemented as a tool for ranking joint probability distributions with respect to said correlations. The derivation parallels the derivations of relative entropy with an emphasis on the behavior of independent variables. By constraining the functional $I$ according to special cases, we arrive at its general functional form and hence establish a clear meaning behind its definition. We also discuss the notion of sufficiency and offer a new definition which broadens its applicability.
Causal mediation approaches have been primarily developed for the goal of ‘explanation’, that is, to understand the pathways that lead from a cause to its effect. A related goal is to evaluate the impact of interventions on mediators, for example in epidemiological studies seeking to inform policies to improve outcomes for sick or disadvantaged populations by targeting intermediate processes. While there has been some methodological work on evaluating mediator interventions, no proposal explicitly defines the target estimands in terms of a ‘target trial’: the hypothetical randomized controlled trial that one might seek to emulate. In this paper, we define so-called interventional effects in terms of a target trial evaluating a number of population-level mediator interventions in the context of multiple interdependent mediators and real-world constraints of policy implementation such as limited resources, with extension to the evaluation of sequential interventions. We describe the assumptions required to identify these novel effects from observational data and a g-computation estimation method. This work was motivated by an investigation into alternative strategies for improving the psychosocial outcomes of adolescent self-harmers, based on data from the Victorian Adolescent Health Cohort Study. We use this example to show how our approach can be used to inform the prioritization of alternative courses of action. Our proposal opens up avenues for the definition and estimation of mediation effects that are policy-relevant, providing a valuable tool for building an evidence base on which to justify future time and financial investments in the development and evaluation of interventions.
How can we understand classification decisions made by deep neural nets? We propose answering this question by using ideas from causal inference. We define the “Causal Concept Effect” (CaCE) as the causal effect that the presence or absence of a concept has on the prediction of a given deep neural net. We then use this measure as a mean to understand what drives the network’s prediction and what does not. Yet many existing interpretability methods rely solely on correlations, resulting in potentially misleading explanations. We show how CaCE can avoid such mistakes. In high-risk domains such as medicine, knowing the root cause of the prediction is crucial. If we knew that the network’s prediction was caused by arbitrary concepts such as the lighting conditions in an X-ray room instead of medically meaningful concept, this would prevent us from disastrous deployment of such models. Estimating CaCE is difficult in situations where we cannot easily simulate the do-operator. As a simple solution, we propose learning a generative model, specifically a Variational AutoEncoder (VAE) on image pixels or image embeddings extracted from the classifier to measure VAE-CaCE. We show that VAE-CaCE is able to correctly estimate the true causal effect as compared to other baselines in controlled settings with synthetic and semi-natural high dimensional images.
This tutorial covers and contrasts the two main methodologies in unbiased Learning to Rank (LTR): Counterfactual LTR and Online LTR. There has long been an interest in LTR from user interactions, however, this form of implicit feedback is very biased. In recent years, unbiased LTR methods have been introduced to remove the effect of different types of bias caused by user-behavior in search. For instance, a well addressed type of bias is position bias: the rank at which a document is displayed heavily affects the interactions it receives. Counterfactual LTR methods deal with such types of bias by learning from historical interactions while correcting for the effect of the explicitly modelled biases. Online LTR does not use an explicit user model, in contrast, it learns through an interactive process where randomized results are displayed to the user. Through randomization the effect of different types of bias can be removed from the learning process. Though both methodologies lead to unbiased LTR, their approaches differ considerably, furthermore, so do their theoretical guarantees, empirical results, effects on the user experience during learning, and applicability. Consequently, for practitioners the choice between the two is very substantial. By providing an overview of both approaches and contrasting them, we aim to provide an essential guide to unbiased LTR so as to aid in understanding and choosing between methodologies.
In this essay I discuss potential outcome and graphical approaches to causality, and their relevance for empirical work in economics. I review some of the work on directed acyclic graphs, including the recent ‘The Book of Why,’ by Pearl and MacKenzie. I also discuss the potential outcome framework developed by Rubin and coauthors, building on work by Neyman. I then discuss the relative merits of these approaches for empirical work in economics, focusing on the questions each answer well, and why much of the the work in economics is closer in spirit to the potential outcome framework.

### R Packages worth a look

Global Envelopes (GET)
Implementation of global envelopes with intrinsic graphical interpretation which can be used for graphical Monte Carlo and permutation tests where the …

Interactive Document for Working with Variance Analysis (VTShiny)
An interactive document on the topic of variance analysis using ‘rmarkdown’ and ‘shiny’ packages. Runtime examples are provided in the package function …

Multistage Allocation (R2BEAT)
Multivariate optimal allocation for different domains in one and two stages stratified sample design. R2BEAT extends the Neyman (1934) <doi:10.2307/ …

Collinearity Detection in a Multiple Linear Regression Model (multiColl)
The detection of worrying approximate collinearity in a multiple linear regression model is a problem addressed in all existing statistical packages. H …

### Improve GRNN by Weighting

In the post (https://statcompute.wordpress.com/2019/07/14/yet-another-r-package-for-general-regression-neural-network), several advantages of General Regression Neural Network (GRNN) have been discussed. However, as pointed out by Specht, a major weakness of GRNN is the high computational cost required for a GRNN to generate predicted values based on a new input matrix due to its unique network structure, e.g. the number of neurons equal to the number of training samples.

For practical purposes, there is however no need to assign a neuron to each training sample, given the data duplication in real-world model development samples. Instead, a weighting scheme can be employed to reflect the frequency count of each unique training sample. A major benefit of the weight assignment is the ability to improve the efficiency of calculating predicted values, which depends on the extent of data duplicates. More attractively, the weighting application can bring up the possibility of using clustering or binning techniques to preprocess the training data so as to overcome the aforementioned weakness to a large degree.

Below is a demonstration showing the efficiency gain by using the weighting scheme in GRNN.

1. First of all, I constructed a sample data with duplicates to double the size of the original Boston dataset. Based on the constructed data, a GRNN named “N1” was trained.
2. Secondly, I generated another sample data by aggregating the above constructed data based on unique samples and calculating the weight of each unique data point based on its frequency. Based on the aggregated data, another GRNN named “N2” was also trained.

As shown in the output, predicted vectors from both “N1” and “N2” are identical. However, the computing time can be reduced to half by applying the weighting. All R functions used in the example can be found in https://github.com/statcompute/GRnnet/blob/master/code/grnnet.R.

For people interested in the SAS implementation of GRNN, two SAS macros are also available in https://github.com/statcompute/GRnnet/blob/master/code/grnn_learn.SAS and https://github.com/statcompute/GRnnet/blob/master/code/grnn_pred.SAS.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Reversi in R – Part 1: Bare Bones

(This article was first published on Statistics et al., and kindly contributed to R-bloggers)

In this post, I showcase a bare-bones point-and-click implementation of the classic board Reversi (also called Othello*) in the R programming language. R is typically used for more serious, statistical endeavors, but it works reasonably well for more playful projects. Building a classic game like this is an excellent high-school level introduction to programming, as well as a good basis for building and testing game AI.

If you want to skip ahead and just play Reversi in R right away, download this .R file:
open it either base R or Rstudio, set the working directory to the location where you downloaded to, and run the file with source(“Reversi Functions.r”) . This will print some basic play instructions for you.
The starting configuration shown here is just one of many possible configurations that the code can handle. This program can allow for many variants in play including board dimensions, ‘walls’ that block capture, and three or more players. These will be shown in a later post.

When running, the active player can click on any of the legal spaces, marked with red circles, to place a stone and capture any enemy pieces that are sandwiched between the newly placed piece and already-placed pieces.
The board after the first couple moves by “B”lack and “W”hite are shown in the next two figures.

The game continues until neither player has a legal move, which usually, but not always, occurs when the board is filled or when there is only one player’s pieces remaining.
Before we look at the user-defined functions for this game, let’s review a few important functions in base R.

paste(x, collapse=””)  # Takes a vector of string variables and converts it into a single string with nothing between each of the original strings.

strsplit(x,””)[[1]] # Takes a single string and converts it into a vector of string variables of one character each.

plot.new(); plot.window(xlim=.. ylim=) # Clears the existing plot, if any, and creates a new one with the given x and y limits.

locator(1) # Takes the first mouse click on a plot and extracts the x and y coordinates on the plot.
The pasteand strsplit functions are useful for passing information about the stones to each other in a convenient way. String manipulation is slow, so it’s not the most computationally efficient way to handle this, but it’s also not the biggest task. If I had to run thousands of games quickly, say, for machine learning purposes, I would have used a more sophisticated method here.
The locator function is what makes it possible to play this game with a mouse instead of a keyboard, which makes the whole thing much more enjoyable to play and test.
The game instead is broken down into as many user-defined functions as is convenient. In other words, the programming is modular, it’s broken down into many small modules. Modularity makes it easier to modify the program and prevents errors that may occur from having to change the same code in multiple locations. Furthermore, with good function names, the whole process is easier to read by a human.
We start with setup.board, which establishes the board dimensions, starting pieces, number of players, and any walls or gaps. For this demo, let’s stick with the classic Othello setup.
setup.board = function(style = “Basic Othello”)
{
if(style == “Basic Othello”)
{
board = matrix(“.”,nrow=8,ncol=8)
board[4,4] = “W”
board[5,5] = “W”
board[4,5] = “B”
board[5,4] = “B”
}
}
A piece can be placed anywhere that a ‘sandwich’ can be made. To determine this, we first need a function to ‘look’ in a given direction from a given position on a board. For example, looking from b1 in the southeast direction on this board…

…should produce a ‘look’ of ”B,W,W,space,space,space”. The look.to function takes in a board, position, and direction, and returns the state of the board spaces from that location outwards until it hits the edge of the board. The direction dictates the values of xstep and ystep, which in turn dictate which spaces are the board are examined. the order The look.around function calls look.to from a given position for all eight directions, one at a time.
look.to = function(board, position, direction)
{
if(direction == “N”){  xstep = 0;   ystep = -1}
if(direction == “NE”){ xstep = 1;   ystep = -1}
if(direction == “W”){  xstep = -1;  ystep = 0}
if(direction == “NW”){ xstep = -1;  ystep = -1}
## Do this until we look to the edge of the board
while(this_x > 0 & this_x <= ncol(board) &
this_y > 0 & this_y <= nrow(board))
{
### Record the stone (or space) and xy coords at the observed location
stones = c(stones, board[this_y,this_x])
xlist = c(xlist, this_x)
ylist = c(ylist, this_y)
### Iterate the observed location based on the selected direction
this_x = this_x + xstep
this_y = this_y + ystep
}
}
The legal.look function takes a given sequence of stones from the look.to function, and determines if a given player (a single character of a string) can make a sandwich in that direction. It returns the number of sandwiched enemy stones. Note that the code considers any character other a blank space ( . ) or a wall ( # ) to be an enemy stone, allowing this code to work with 3+ players.
The legal.directions function calls legal.look for each direction and returns the list of directions in which a capture can be made by the given player by placing at the given position. The which.legal function calls legal.direction for each empty position on the board to determine which, if any, spaces a given player may place on the board.
legal.look = function(player, look)
{
Nenemies = 0
enemy_chain = TRUE
while( Nenemies < length(look) & enemy_chain)
{
### Examine a space, if it’s anything except
### the current player’s piece, a space, or a wall, it’s an enemy piece
examined_piece = look[Nenemies + 1]
if(examined_piece %in% c(player,”.”,” “,”#”))
{ ## If it’s not an enemy, stop looking
enemy_chain = FALSE
}
else
{   ## If it is an enemy, iterate and keep looking
Nenemies = Nenemies + 1
}
}

### If there are enemy pieces all the way to the end of the board
### Return ‘no capture’.
if(Nenemies == length(look)){return(0)}
## There must be an allied piece at immediately after the enemy pieces
## If so, return the number of pieces that can be captured
examined_piece = look[Nenemies + 1]
if(examined_piece == player)
{
return(Nenemies)
}
### Otherwise return ‘no capture’
return(0)
}
The plot.game function takes a board state, active player, and possibly the matrix of legal moves, and draws this information as a plot.
plot.game = function(board,player,legal_board=NULL,showlegal=TRUE)
The play.move function checks if a move at the position is legal by a given player on the given board. If it is, it updates the board by placing a stone for player at position and makes all the appropriate captures. It returns the new board state. It uses look.to and legal.look to determine which stones on the board to change.
play.move = function(board, this_player, position)
The play game is the main function that that runs the whole game.
play.game = function(board=NA)
It takes in mouse clicks with locator and converts them into positions on the board (note the inversion of the y-axis).

mouseclick = locator(1)
input_x = round(mouseclick$x) input_y = round(mouseclick$y)
input_y = nrow(board) – input_y + 1
Before using that click, it first checks if that it maps to a space on the board, and that the space is a legal move by the given player. (If there is no legal move, it will accept any click as a ‘pass’, and move to the next player)
while(all(board == new_board) & any(legal_board == TRUE))
if(input_x > 0 & input_x <= Nx & input_y > 0 & input_y <= Ny)
legal_board = which.legal(board, current_player)
if(legal_board[input_y,input_x])
If the move is legal, it calls play.move to update the board and cycles to the next player.
new_board = play.move(board,this_player=current_player,position=c(input_y,input_x))
current_player = player_list[ 1 + (player_idx %% Nplayers)]
It continues to do this until no player has a legal move, or there are no spaces left on the board. After which it returns the board state as a matrix as well as the score of each player.
while(  any(board == “.”) & players_skipped < length(player_list))
print(table(board))
return(board)
Finally, you can use setup.board to create a non-standard board and use it in play.game. For example, a 4×10 board can be created and used with the “Othello Wide” style.
board = setup.board(“Othello Wide”)
play.game(board)

* There are slight differences between the commercial version Othello, and the public domain game Reversi. Also, the name Othello is trademarked by Mattel.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Evolving Networks

Finding neural network topologies is a problem with a rich history in evolutionary computing, or neuroevolution. This post will revisit some of the key ideas and outgoing research paths. Code related to this post is found here: [code link].

## NEAT

In their 2002 paper, Kenneth Stanley & Risto Miikkulainen proposed the foundational algorithm NeuroEvolution of Augmenting Topologies (NEAT). I’ll focus on this algorithm as a starting point; for earlier developments please see Section 5.2 of this great review, Schaffer’s 1992 review, and Yao’s 1999 review. The NEAT paper introduces the ideas clearly and there are other great NEAT overviews, so to change it up I will try to present the algorithm with generic notation, which is perhaps useful for thinking about how to modify the algorithm or apply it to a new problem setting.

I’ve also made an implementation [code link] contained in a single python file; you might find it useful to see the entire algorithm in one place, or as a comparison if you also implement NEAT as an exercise. For a more robust implementation, see NEAT-Python (which the code is based on) and its extension PyTorch-NEAT.

#### Problem

NEAT addresses the problem of finding a computation graph $G = (V, E)$. Each node $v\in V$ has a bias, activation, and aggregation function, written $(b, a, \text{agg})$, and each edge $e\in E$ has a source and destination, a weight, and may be active or inactive, written  $(u, v, w, \text{active})$.

Searching through the space of these graphs amounts to searching through a space of neural networks. NEAT conducts this search using a few generic neuroevolution concepts, which I’ll focus on below, and often implements them with design decisions that can be relaxed or modified for different problems.

#### High-Level Method

NEAT iteratively produces a set of candidates $P=\{G_1,\ldots,G_N\}$, using a candidate partitioning $S=\{S_1,\ldots,S_M\}$ where $S_j \subseteq P$ and a given function $f:\mathcal{G}\rightarrow\mathbb{R}$ which measures a candidate’s quality. The candidates, partitions, and quality function are known as ‘population’, ‘species’, and ‘fitness’, respectively.

The candidate set (“population”, rectangle) contains partitions (“species”, circles), each containing candidate graphs (diamonds).

Each NEAT iteration returns a new candidate set and new partitioning, denoted as $E(P^{(i)}, S^{(i)}, f)\rightarrow P^{(i+1)},S^{(i+1)}$. Intuitively E is an ‘evolution step’ that produces a new ‘generation’. NEAT’s goal is to eventually output a ‘good’ candidate set $P^{(i)}$. Typically good means that the best candidate has quality exceeding a goal threshold, $\max_j f(G^{(i)}_j) > \tau$. We then use this high-performing neural network on a task.

### Evolution Steps

Each evolution step $E(P^{(i)}, S^{(i)}, f)$ produces a new population using four ideas: mutation, crossover, fitness ranking, and partitioning.

Mutation $m(G)\rightarrow G$ randomly perturbs a candidate graph. In NEAT, mutations consist of adding or deleting a node, adding or deleting an edge, or perturbing a node or edge property (such as an edge’s weight or a node’s activation). Each mutation type occurs with a pre-specified probability, and involves a random perturbation; for instance an add-edge randomly chooses an edge location and weights are adjusted with Gaussian noise. One can design other mutations, such as resetting a weight to a new value.

An add-node mutation followed by an add-edge mutation. The add-node mutation splits an existing edge into two edges.

Crossover $c(G_i,G_j)\rightarrow G$ produces a new candidate by swapping properties of two existing candidates. In NEAT, roughly speaking if $G_i$ and $G_j$ have a matching node $v_i, v_j$, then $G$ receives one of them randomly (similarly for edges). $G$ simply inherits non-matching nodes or edges. The notion of ‘matching’ is tricky due to isomorphic graph structures, so NEAT assigns an ID to each new node and edge, then uses these IDs for comparison (see 2.2 and 3.2 of the NEAT paper for details).  In part due to the added complexity, some papers leave out crossover completely.

NEAT crossover mechanism (diagram from the NEAT paper)

Fitness Ranking follows its name, first ranking candidates according to fitness, $(G_{1'},\ldots,G_{N'})$ where $i'>j'$ means $f(G_{i'})>f(G_{j'})$. Only the top (e.g. 20%) candidates are used for crossover and mutation. This locally biases the search towards candidates with high relative fitness.

Partitioning, or speciation, groups candidates according to a distance function $d(G_i,G_j)$. One use of the partitions is to promote diversity in the solution space by modifying each candidate’s fitness. To do so, NEAT defines a distance function and adjusts each candidate’s fitness based on its partition size. Each partition is guaranteed a certain number of candidates in the next generation based on the adjusted fitnesses.

Intuitively, a small partition contains graphs with relatively unique characteristics which might ultimately be useful in a final solution, even if they do not yield immediate fitness. To avoid erasing these characteristics from the search during fitness ranking, the small partition candidates receive guaranteed spots in the next phase.

Novel structures (diamonds, triangles) may ultimately yield a performance gain after further development, despite initially having lower fitness (light green) compared to common, developed structures with high fitness (dark green).

We can write this step as $f_{\text{partition}}(P,f_1,\ldots,f_N,S)\rightarrow (f_1',\ldots,f_N', S')$. We might alternatively view this step as just fitness re-ranking, $f_{\text{re-rank}}(P,f_1,\ldots,f_N)\rightarrow (f_1',\ldots,f_N')$, without requiring actual partitions, though without partitions it may be tricky to achieve the exact ‘guaranteed spots’ behavior.

The partitions $S'$ could also be useful in problems requiring a collection of solutions rather than a single optimal solution. For instance, rather than just selecting the highest performing candidate, we might consider the best candidate in each partition as the final output of NEAT, thus producing a collection of networks, each maximizing fitness in a different way than the others (assuming a partitioning scheme that promotes diverse solutions).

### Example Results

Let’s use the implementation [code link] to solve an xor problem and the Cartpole and Lunar-Lander gym environments.

To solve xor, NEAT finds a network with a single hidden node:

An xor network, including input (green), hidden (blue), and output (red) nodes. Labels show edge weights and node activations.

CartPole-v0 is easy to solve (even random search is sufficient), and NEAT finds a simple network without hidden units (for fun we’ll also construct an artificially complicated solution in the Variations section below):

A CartPole-v0 network.

LunarLander-v2 is more difficult, and NEAT finds a network with non-trivial structure:

A LunarLander-v2 network.

On the xor environment, NEAT creates around 10 partitions, on Cartpole just 1, and on LunarLander it tends to create 2-3 partitions. On these simple environments NEAT also performs similarly without crossover.

Variations As mentioned before, we may want NEAT to produce a diverse set of solutions rather than a single solution. To manually demonstrate this intuition, suppose I want NEAT to find a network that uses sigmoid activations, and one that uses tanh. To do so, I increased the activation parameter in the node distance function (the $d(\cdot,\cdot)$ used in partitioning), then chose the highest scoring network from each partition. On Cartpole, the partitions now naturally separate into sigmoid and tanh networks:

While Cartpole is evidently simple enough for a network with no hidden layers, perhaps we want to follow a trend of using large networks even for easy problems. We can modify the fitness function to ‘reject’ networks without a certain number of connections, and NEAT will yield more complicated solutions:

A more complicated way to play Cartpole.

In particular, I added -1000 to the fitness when the network had less than k connections, starting with k=5 and incrementing k each time a candidate achieved max fitness at the current (stopping at k=20).

## Discussion & Extensions

Vanilla NEAT attempts to find both a network structure and the corresponding weights from scratch. This approach is very flexible and involves minimal assumptions, but could limit NEAT to problems requiring small networks. However, the key idea can still be applied or modified in creative ways.

### Minimal Assumptions

NEAT represents an extreme on the spectrum of learned versus hand-crafted architectural biases, by placing few assumptions on graph structure or learning algorithm. At a very speculative level, such flexibility may be useful for networks with backward or long-range connections that may be difficult to hand design, or as part of a learning process which involves removing or adding connections rather than optimizing weights of a fixed architecture.

A more concrete example is the recent Weight Agnostic Neural Networks paper (Gaier & Ha 2019), where the authors aimed to find a model for a task by finding good network structures, rather than finding good weights for a fixed network structure; they use a single shared weight value in each network and evaluate fitness on multiple rollouts, with a randomly selected weight value for each rollout. In this case, a NEAT variant allowed finding exotic network structures from scratch, without requiring prior knowledge such as hand-designed layer types.

As a rough approximation, I modified the NEAT implementation so that each network only has a single shared weight value, and included more activation functions (sin, cos, arctan, abs, floor). Each run of evaluation sets the network’s shared weight to a randomly sampled value ($\mathcal{U}([-2,2])$ excluding $[-0.1,0.1]$), and the network’s overall fitness is the average fitness over 10 runs. On XOR, NEAT finds a network with similar structure as before:

XOR network with a random shared weight value

This was just an initial experiment to give intuition, so check out the WANN paper for a good way of doing this for non-trivial tasks.

### Scalability

One could also consider improving NEAT’s scalability. A high level strategy is to reduce the search space by restricting the search to topologies, searching at a higher abstraction level, or introducing hierarchy.

An example is DeepNEAT (Miikkulainen et al 2017), which evolves graph structures using NEAT, but with nodes representing layers rather than single neurons, and edges specifying layer connectivity. Weight and bias values are learned with back-propagation. The authors further extend DeepNEAT to CoDeepNEAT, which represents graphs with a two level hierarchy defined by a blueprint specifying connectivity of modules. Separate blueprint and module populations are evolved, with the full graph (module + blueprint) assembled for fitness evaluation.

Blueprint and Module populations. Each node in a Blueprint (hexagon) is a Module.

This view is quite general, allowing learning the internal structure of reusable modules as well as how they are composed. In the experiments the authors begin with modules involving known components such as convolutional layers or LSTM cells and evolve only specific parts (e.g. connections between LSTM layers), but one might imagine searching for completely novel, reusable modules.

### Indirect Encodings

NEAT essentially writes down a description, or direct encoding, of every node and edge and their properties, then evolves these descriptions. The description size grows as the network grows, making the search space prohibitively large.

An alternative is to use a function to describe a network. For instance, we can evaluate a function $f:\mathbb{R}^4\rightarrow \mathbb{R}$ at pairs of points from a $V\times V$ grid to obtain a weighted adjacency matrix. This function is an example of an indirect encoding of the graph. Assuming the description of $f$ is small, we can describe very large networks by evaluating a suitable $f$ using a large grid or coordinate pattern. A neural network with a variety of activations that is evaluated in this manner is called a compositional pattern producing network (CPPN) [see, also].

HyperNEAT (Stanley et al. 2009) uses this idea to find network weights by evolving an indirect encoding function. HyperNEAT uses NEAT to evolve a (small) CPPN to act as $f$, then evaluates $f$ at coordinates from a hyper-cube, resulting in weights of a (larger) network used for fitness evaluation.

Several works have adopted or extended ideas from HyperNEAT for a deep learning setting. Fernando et al. 2016 proposed the Differentiable Pattern Producing Network (DPPN) which evolves the structure of a weight-generating CPPN $f$ while using back-propagation for its weights. The authors evolve a 200 parameter $f$ that generates weights for a fully connected auto-encoder with ~150,000 weights, though it is for a small-scale MNIST image de-noising task. Interestingly the weight generating function $f$ learns to produce convolution-esque filters embedded in the fully connected network.

From [Fernando et al 2016]

HyperNetworks (Ha et al 2016) further scales HyperNEAT’s notion of indirect encodings to more complex tasks by learning a weight generation function with end-to-end training, including an extension that can generate time-varying weights for recurrent networks:

From [Ha et al. 2016]

### Wrapping Up

In this post we revisited a core technique for generating neural network topologies, and briefly traced some of its outgoing research paths. We took a brief step back from the constraints of pre-defined-layer architectures and searched through a space of very general (albeit small-scale) topologies. It was interesting to see how this generality has been refined towards some larger scale tasks, but also revisited.  We briefly saw how fitness re-ranking and partitioning can be used to yield a set of distinct solutions, which connects to other concepts that I may discuss further in future posts.

[Code]

### RPushbullet 0.3.2

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new release 0.3.2 of the RPushbullet package is now on CRAN. RPushbullet is interfacing the neat Pushbullet service for inter-device messaging, communication, and more. It lets you easily send alerts like the one to the left to your browser, phone, tablet, … – or all at once.

This is the first new release in almost 2 1/2 years, and it once again benefits greatly from contributed pull requests by Colin (twice !) and Chan-Yub – see below for details.

#### Changes in version 0.3.2 (2019-07-21)

• The Travis setup was robustified with respect to the token needed to run tests (Dirk in #48)

• The configuration file is now readable only by the user (Colin Gillespie in #50)

• At startup initialization is now more consistent (Colin Gillespie in #53 fixing #52)

• A new function to fetch prior posts was added (Chanyub Park in #54). 

Courtesy of CRANberries, there is also a diffstat report for this release. More details about the package are at the RPushbullet webpage and the RPushbullet GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### What’s published in the journal isn’t what the researchers actually did.

David Allison points us to these two letters:

Alternating Assignment was Incorrectly Labeled as Randomization, by Bridget Hannon, J. Michael Oakes, and David Allison, in the Journal of Alzheimer’s Disease.

Change in study randomization allocation needs to be included in statistical analysis: comment on ‘Randomized controlled trial of weight loss versus usual care on telomere length in women with breast cancer: the lifestyle, exercise, and nutrition (LEAN) study,’ by Stephanie Dickinson, Lilian Golzarri-Arroyo, Andrew Brown, Bryan McComb, Chanaka Kahathuduwa, and David Allison, in Breast Cancer Research and Treatment.

It can be surprisingly difficult for researchers to simply say exactly what they did. Part of this might be a desire to get credit for design features such as random assignment that were too difficult to actually implement; part of it could be sloppiness/laziness; but part of it could just be that, when you write, it’s so easy to drift into conventional patterns. Designs are supposed to be random assignment, so you label them as random assignment, even if they’re not. The above examples are nothing like pizzagate, but it’s part of the larger problem that the scientific literature can’t be trusted. It’s not just that you can’t trust the conclusions; it’s also that papers make claims that can’t possibly be supported by the data in them, and that papers don’t state what the researchers actually did.

As always, I’m not saying these researchers are bad people. Honesty and transparency are not enuf. If you’re a scientist, and you write up your study, and you don’t describe it accurately, we—the scientific community, the public, the consumers of your work—are screwed, even if you’re a wonderful, honorable person. You’ve introduced buggy software in the world, and the published corrections, if any, are likely to never catch up.

P.S. Hannon, Oakes, and Allison explain why it matters that the design described as a “randomized controlled trial” wasn’t actually that:

By sequentially enrolling participants using alternating assignment, the researchers and enrolling physicians in this study were able to know to which group the next participant would be assigned, and there is no allocation concealment. . . .

The allocation method employed by Ito et al. allows the research team to determine in which group a participant would be assigned, and thus could (unintentionally) manipulate the enrollment. . . .

Alternating assignment, or similarly using patient chart numbers, days of the week, date of birth, etc., are nonrandom methods of group allocation, and should not be used in place of randomly assigning participants . . .

There are a number of disciplines (i.e., public health, community interventions, etc.) which commonly employ nonrandomized intervention evaluation studies, and these can be conducted with rigor. It is crucial for researchers conducting these nonrandomized trials to report procedures accurately.

### Whats new on arXiv

We derive the functional form of mutual information (MI) from a set of design criteria and a principle of maximal sufficiency. The (MI) between two sets of propositions is a global quantifier of correlations and is implemented as a tool for ranking joint probability distributions with respect to said correlations. The derivation parallels the derivations of relative entropy with an emphasis on the behavior of independent variables. By constraining the functional $I$ according to special cases, we arrive at its general functional form and hence establish a clear meaning behind its definition. We also discuss the notion of sufficiency and offer a new definition which broadens its applicability.
In this paper, we present the Metamorphic Testing of an in-use deep learning based forecasting application. The application looks at the past data of system characteristics (e.g. memory allocation’) to predict outages in the future. We focus on two statistical / machine learning based components – a) detection of co-relation between system characteristics and b) estimating the future value of a system characteristic using an LSTM (a deep learning architecture). In total, 19 Metamorphic Relations have been developed and we provide proofs & algorithms where applicable. We evaluated our method through two settings. In the first, we executed the relations on the actual application and uncovered 8 issues not known before. Second, we generated hypothetical bugs, through Mutation Testing, on a reference implementation of the LSTM based forecaster and found that 65.9% of the bugs were caught through the relations.
Deep neural networks have achieved tremendous success in various fields including medical image segmentation. However, they have long been criticized for being a black-box, in that interpretation, understanding and correcting architectures is difficult as there is no general theory for deep neural network design. Previously, precision learning was proposed to fuse deep architectures and traditional approaches. Deep networks constructed in this way benefit from the original known operator, have fewer parameters, and improved interpretability. However, they do not yield state-of-the-art performance in all applications. In this paper, we propose to analyze deep networks using known operators, by adopting a divide-and-conquer strategy to replace network components, whilst retaining its performance. The task of retinal vessel segmentation is investigated for this purpose. We start with a high-performance U-Net and show by step-by-step conversion that we are able to divide the network into modules of known operators. The results indicate that a combination of a trainable guided filter and a trainable version of the Frangi filter yields a performance at the level of U-Net (AUC 0.974 vs. 0.972) with a tremendous reduction in parameters (111,536 vs. 9,575). In addition, the trained layers can be mapped back into their original algorithmic interpretation and analyzed using standard tools of signal processing.
Mixtures-of-Experts (MoE) are conditional mixture models that have shown their performance in modeling heterogeneity in data in many statistical learning approaches for prediction, including regression and classification, as well as for clustering. Their estimation in high-dimensional problems is still however challenging. We consider the problem of parameter estimation and feature selection in MoE models with different generalized linear experts models, and propose a regularized maximum likelihood estimation that efficiently encourages sparse solutions for heterogeneous data with high-dimensional predictors. The developed proximal-Newton EM algorithm includes proximal Newton-type procedures to update the model parameter by monotonically maximizing the objective function and allows to perform efficient estimation and feature selection. An experimental study shows the good performance of the algorithms in terms of recovering the actual sparse solutions, parameter estimation, and clustering of heterogeneous regression data, compared to the main state-of-the art competitors.
One of the questions that arises when designing models that learn to solve multiple tasks simultaneously is how much of the available training budget should be devoted to each individual task. We refer to any formalized approach to addressing this problem (learned or otherwise) as a task selection policy. In this work we provide an empirical evaluation of the performance of some common task selection policies in a synthetic bandit-style setting, as well as on the GLUE benchmark for natural language understanding. We connect task selection policy learning to existing work on automated curriculum learning and off-policy evaluation, and suggest a method based on counterfactual estimation that leads to improved model performance in our experimental settings.
We present new techniques for automatically constructing probabilistic programs for data analysis, interpretation, and prediction. These techniques work with probabilistic domain-specific data modeling languages that capture key properties of a broad class of data generating processes, using Bayesian inference to synthesize probabilistic programs in these modeling languages given observed data. We provide a precise formulation of Bayesian synthesis for automatic data modeling that identifies sufficient conditions for the resulting synthesis procedure to be sound. We also derive a general class of synthesis algorithms for domain-specific languages specified by probabilistic context-free grammars and establish the soundness of our approach for these languages. We apply the techniques to automatically synthesize probabilistic programs for time series data and multivariate tabular data. We show how to analyze the structure of the synthesized programs to compute, for key qualitative properties of interest, the probability that the underlying data generating process exhibits each of these properties. Second, we translate probabilistic programs in the domain-specific language into probabilistic programs in Venture, a general-purpose probabilistic programming system. The translated Venture programs are then executed to obtain predictions of new time series data and new multivariate data records. Experimental results show that our techniques can accurately infer qualitative structure in multiple real-world data sets and outperform standard data analysis methods in forecasting and predicting new data.
Adversarial examples are of wide concern due to their impact on the reliability of contemporary machine learning systems. Effective adversarial examples are mostly found via white-box attacks. However, in some cases they can be transferred across models, thus enabling them to attack black-box models. In this work we evaluate the transferability of three adversarial attacks – the Fast Gradient Sign Method, the Basic Iterative Method, and the Carlini & Wagner method, across two classes of models – the VGG class(using VGG16, VGG19 and an ensemble of VGG16 and VGG19), and the Inception class(Inception V3, Xception, Inception Resnet V2, and an ensemble of the three). We also outline the problems with the assessment of transferability in the current body of research and attempt to amend them by picking specific ‘strong’ parameters for the attacks, and by using a L-Infinity clipping technique and the SSIM metric for the final evaluation of the attack transferability.
In this paper, we develop and explore deep anomaly detection techniques based on the capsule network (CapsNet) for image data. Being able to encoding intrinsic spatial relationship between parts and a whole, CapsNet has been applied as both a classifier and deep autoencoder. This inspires us to design a prediction-probability-based and a reconstruction-error-based normality score functions for evaluating the ‘outlierness’ of unseen images. Our results on three datasets demonstrate that the prediction-probability-based method performs consistently well, while the reconstruction-error-based approach is relatively sensitive to the similarity between labeled and unlabeled images. Furthermore, both of the CapsNet-based methods outperform the principled benchmark methods in many cases.
We propose a new test for inequalities that is simple and uniformly valid. The test compares the likelihood ratio statistic to a chi-squared critical value, where the degrees of freedom is the rank of the active inequalities. This test requires no tuning parameters or simulations, and therefore is computationally fast, even with many inequalities. Further, it does not require an estimate of the number of binding or close-to-binding inequalities. To show that this test is uniformly valid, we establish a new bound on the probability of translations of cones under the multivariate normal distribution that may be of independent interest. The leading application of our test is inference in moment inequality models. We also consider testing affine inequalities in the multivariate normal model and testing nonlinear inequalities in general asymptotically normal models.
We propose a new batch mode active learning algorithm designed for neural networks and large query batch sizes. The method, Discriminative Active Learning (DAL), poses active learning as a binary classification task, attempting to choose examples to label in such a way as to make the labeled set and the unlabeled pool indistinguishable. Experimenting on image classification tasks, we empirically show our method to be on par with state of the art methods in medium and large query batch sizes, while being simple to implement and also extend to other domains besides classification tasks. Our experiments also show that none of the state of the art methods of today are clearly better than uncertainty sampling when the batch size is relatively large, negating some of the reported results in the recent literature.
We can define a neural network that can learn to recognize objects in less than 100 lines of code. However, after training, it is characterized by millions of weights that contain the knowledge about many object types across visual scenes. Such networks are thus dramatically easier to understand in terms of the code that makes them than the resulting properties, such as tuning or connections. In analogy, we conjecture that rules for development and learning in brains may be far easier to understand than their resulting properties. The analogy suggests that neuroscience would benefit from a focus on learning and development.
In this paper, we consider sequential online prediction (SOP) for streaming data in the presence of outliers and change points. We propose an INstant TEmporal structure Learning (INTEL) algorithm to address this problem.Our INTEL algorithm is developed based on a full consideration to the duality between online prediction and anomaly detection. We first employ a mixture of weighted GP models (WGPs) to cover the expected possible temporal structures of the data. Then, on the basis of the rich modeling capacity of this WGP mixture, we develop an efficient technique to instantly learn (capture) the temporal structure of the data that follows a regime shift. This instant learning is achieved only by adjusting one hyper-parameter value of the mixture model. A weighted generalization of the product of experts (POE) model is used for fusing predictions yielded from multiple GP models. An outlier is declared once a real observation seriously deviates from the fused prediction. If a certain number of outliers are consecutively declared, then a change point is declared. Extensive experiments are performed using a diverse of real datasets. Results show that the proposed algorithm is significantly better than benchmark methods for SOP in the presence of outliers and change points.
Parameterized state space models in the form of recurrent networks are often used in machine learning to learn from data streams exhibiting temporal dependencies. To break the black box nature of such models it is important to understand the dynamical features of the input driving time series that are formed in the state space. We propose a framework for rigorous analysis of such state representations in vanishing memory state space models such as echo state networks (ESN). In particular, we consider the state space a temporal feature space and the readout mapping from the state space a kernel machine operating in that feature space. We show that: (1) The usual ESN strategy of randomly generating input-to-state, as well as state coupling leads to shallow memory time series representations, corresponding to cross-correlation operator with fast exponentially decaying coefficients; (2) Imposing symmetry on dynamic coupling yields a constrained dynamic kernel matching the input time series with straightforward exponentially decaying motifs or exponentially decaying motifs of the highest frequency; (3) Simple cycle high-dimensional reservoir topology specified only through two free parameters can implement deep memory dynamic kernels with a rich variety of matching motifs. We quantify richness of feature representations imposed by dynamic kernels and demonstrate that for dynamic kernel associated with cycle reservoir topology, the kernel richness undergoes a phase transition close to the edge of stability.
Recent research has introduced ideas from concept drift into process mining to enable the analysis of changes in business processes over time. This stream of research, however, has not yet addressed the challenges of drift categorization, drilling-down, and quantification. In this paper, we propose a novel technique for managing process drifts, called Visual Drift Detection (VDD), which fulfills these requirements. The technique starts by clustering declarative process constraints discovered from recorded logs of executed business processes based on their similarity and then applies change point detection on the identified clusters to detect drifts. VDD complements these features with detailed visualizations and explanations of drifts. Our evaluation, both on synthetic and real-world logs, demonstrates all the aforementioned capabilities of the technique.
The Quick, Draw! Dataset is a Google dataset with a collection of 50 million drawings, divided in 345 categories, collected from the users of the game Quick, Draw!. In contrast with most of the existing image datasets, in the Quick, Draw! Dataset, drawings are stored as time series of pencil positions instead of a bitmap matrix composed by pixels. This aspect makes this dataset the largest doodle dataset available at the time. The Quick, Draw! Dataset is presented as a great opportunity to researchers for developing and studying machine learning techniques. Due to the size of this dataset and the nature of its source, there is a scarce of information about the quality of the drawings contained. In this paper, a statistical analysis of three of the classes contained in the Quick, Draw! Dataset is depicted: mountain, book and whale. The goal is to give to the reader a first impression of the data collected in this dataset. For the analysis of the quality of the drawings, a Classification Neural Network was trained to obtain a classification score. Using this classification score and the parameters provided by the dataset, a statistical analysis of the quality and nature of the drawings contained in this dataset is provided.
We offer a graphical interpretation of unfairness in a dataset as the presence of an unfair causal path in the causal Bayesian network representing the data-generation mechanism. We use this viewpoint to revisit the recent debate surrounding the COMPAS pretrial risk assessment tool and, more generally, to point out that fairness evaluation on a model requires careful considerations on the patterns of unfairness underlying the training data. We show that causal Bayesian networks provide us with a powerful tool to measure unfairness in a dataset and to design fair models in complex unfairness scenarios.
Data providers such as government statistical agencies perform a balancing act: maximising information published to inform decision-making and research, while simultaneously protecting privacy. The emergence of identified administrative datasets with the potential for sharing (and thus linking) offers huge potential benefits but significant additional risks. This article introduces the principles and methods of linking data across different sources and points in time, focusing on potential areas of risk. We then consider confidentiality risk, focusing in particular on the ‘intruder’ problem central to the area, and looking at both risks from data producer outputs and from the release of micro-data for further analysis. Finally, we briefly consider potential solutions to micro-data release, both the statistical solutions considered in other contributed articles and non-statistical solutions.
A recent trend in IR has been the usage of neural networks to learn retrieval models for text based adhoc search. While various approaches and architectures have yielded significantly better performance than traditional retrieval models such as BM25, it is still difficult to understand exactly why a document is relevant to a query. In the ML community several approaches for explaining decisions made by deep neural networks have been proposed — including DeepSHAP which modifies the DeepLift algorithm to estimate the relative importance (shapley values) of input features for a given decision by comparing the activations in the network for a given image against the activations caused by a reference input. In image classification, the reference input tends to be a plain black image. While DeepSHAP has been well studied for image classification tasks, it remains to be seen how we can adapt it to explain the output of Neural Retrieval Models (NRMs). In particular, what is a good ‘black’ image in the context of IR? In this paper we explored various reference input document construction techniques. Additionally, we compared the explanations generated by DeepSHAP to LIME (a model agnostic approach) and found that the explanations differ considerably. Our study raises concerns regarding the robustness and accuracy of explanations produced for NRMs. With this paper we aim to shed light on interesting problems surrounding interpretability in NRMs and highlight areas of future work.
The symmetric sparse matrix-vector multiplication (SymmSpMV) is an important building block for many numerical linear algebra kernel operations or graph traversal applications. Parallelizing SymmSpMV on today’s multicore platforms with up to 100 cores is difficult due to the need to manage conflicting updates on the result vector. Coloring approaches can be used to solve this problem without data duplication, but existing coloring algorithms do not take load balancing and deep memory hierarchies into account, hampering scalability and full-chip performance. In this work, we propose the recursive algebraic coloring engine (RACE), a novel coloring algorithm and open-source library implementation, which eliminates the shortcomings of previous coloring methods in terms of hardware efficiency and parallelization overhead. We describe the level construction, distance-k coloring, and load balancing steps in RACE, use it to parallelize SymmSpMV, and compare its performance on 31 sparse matrices with other state-of-the-art coloring techniques and Intel MKL on two modern multicore processors. RACE outperforms all other approaches substantially and behaves in accordance with the Roofline model. Outliers are discussed and analyzed in detail. While we focus on SymmSpMV in this paper, our algorithm and software is applicable to any sparse matrix operation with data dependencies that can be resolved by distance-k coloring.
We explore a general framework in Markov chain Monte Carlo (MCMC) sampling where sequential proposals are tried as a candidate for the next state of the Markov chain. This sequential-proposal framework can be applied to various existing MCMC methods, including Metropolis-Hastings algorithms using random proposals and methods that use deterministic proposals such as Hamiltonian Monte Carlo or the bouncy particle sampler. Sequential-proposal MCMC methods construct the same Markov chains as those constructed by the delayed rejection method under certain circumstances. We demonstrate that applications of the sequential-proposal framework to Hamiltonian Monte Carlo (HMC) methods can lead to improved numerical efficiency compared to standard HMC methods and the No-U-Turn sampler. Finally, we show that the sequential-proposal bouncy particle sampler enables the constructed Markov chain to pass through regions of low target density and thus facilitates better mixing of the chain when the target density is multimodal.
In this paper, we introduce a new performance metric in the framework of status updates that we will refer to as the Age of Incorrect Information (AoII). This new metric deals with the shortcomings of both the Age of Information (AoI) and the conventional error penalty functions as it neatly extends the notion of fresh updates to that of fresh ‘informative’ updates. The word informative in this context refers to updates that bring new and correct information to the monitor side. After properly motivating the new metric, and with the aim of minimizing its average, we formulate a Markov Decision Process (MDP) in a transmitter-receiver pair scenario where packets are sent over an unreliable channel. We show that a simple ‘always update’ policy minimizes the aforementioned average penalty along with the average age and prediction error. We then tackle the general, and more realistic case, where the transmitter cannot surpass a certain power budget. The problem is formulated as a Constrained Markov Decision Process (CMDP) for which we provide a Lagrangian approach to solve. After characterizing the optimal transmission policy of the Lagrangian problem, we provide a rigorous mathematical proof to showcase that a mixture of two Lagrange policies is optimal for the CMDP in question. Equipped with this, we provide a low complexity algorithm that finds the optimal operating point of the constrained scenario. Lastly, simulation results are laid out to showcase the performance of the proposed policy and highlight the differences with the AoI framework.
Neural networks using transformer-based architectures have recently demonstrated great power and flexibility in modeling sequences of many types. One of the core components of transformer networks is the attention layer, which allows contextual information to be exchanged among sequence elements. While many of the prevalent network structures thus far have utilized full attention — which operates on all pairs of sequence elements — the quadratic scaling of this attention mechanism significantly constrains the size of models that can be trained. In this work, we present an attention model that has only linear requirements in memory and computation time. We show that, despite the simpler attention model, networks using this attention mechanism can attain comparable performance to full attention networks on language modeling tasks.

### Big News: Porting vtreat to Python

We at Win-Vector LLC have some big news.

We are finally porting a streamlined version of our R vtreat variable preparation package to Python.

vtreat is a great system for preparing messy data for suprevised machine learning.

The new implementation is based on Pandas, and we are experimenting with pushing the sklearn.pipeline.Pipeline APIs to their limit. In particular we have found the .fit_transform() pattern is a great way to express building up a cross-frame to avoid nested model bias (in this case .fit_transform() != .fit().transform()). There is a bit of difference in how object oriented APIs compose versus how functional APIs compose. We are making an effort to research how to make this an advantage, and not a liability.

The new repository is here and the first example regression is here. Next up is classification (likely natively multinomial this time). After that a few validation suites to prove the two vtreat systems work similarly. And then we have some exciting new capabilities.

The first application is going to be a shortening and streamlining of our current 4 day data science in Python course (while allowing more concrete examples!).

This also means data scientists who use both R and Python will have a few more tools that present similarly in each language.

Here is a non-trivial classification example.

### R Packages worth a look

Send Formatted Messages, Images and Objects to Microsoft ‘Teams’ (teamr)
Package of wrapper functions using R6 class to send requests to Microsoft ‘Teams’ <

Normalized Power Prior Bayesian Analysis (
NPP)
Posterior sampling in several commonly used distributions using normalized power prior as described in Duan, Ye and Smith (2006) <doi:10.1002/env.75 …

Diagonally Dominant Principal Component Analysis (ddpca)
Consider the problem of decomposing a large covariance matrix into a low rank matrix plus a diagonally dominant matrix. This problem is called Diagonal …

Wrappers for ‘GDAL’ Utilities Executables (gdalUtilities)
R’s ‘sf’ package ships with self-contained ‘GDAL’ executables, including a bare bones interface to several of the ‘GDAL’-related utility programs colle …

### Document worth reading: “Are GANs Created Equal A Large-Scale Study”

Generative adversarial networks (GAN) are a powerful subclass of generative models. Despite a very rich research activity leading to numerous interesting GAN algorithms, it is still very hard to assess which algorithm(s) perform better than others. We conduct a neutral, multi-faceted large-scale empirical study on state-of-the art models and evaluation measures. We find that most models can reach similar scores with enough hyperparameter optimization and random restarts. This suggests that improvements can arise from a higher computational budget and tuning more than fundamental algorithmic changes. To overcome some limitations of the current metrics, we also propose several data sets on which precision and recall can be computed. Our experimental results suggest that future GAN research should be based on more systematic and objective evaluation procedures. Finally, we did not find evidence that any of the tested algorithms consistently outperforms the original one. Are GANs Created Equal A Large-Scale Study

### Program Evaluation: Difference-in-differences in R

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

## Tags

Regression analysis is one of the most demanding machine learning methods in 2019. One group of regression analysis for measuring effects and to evaluate a policy program is Difference-in-Difference. This method is well suited for benchmarking and finding improvements for optimization in organizations. It can, therefore, be used to design organizations so they generate more value for employees and customers. In this article, you learn how to do difference-in-difference in R.

## Methodology

The difference in differences (DiD) method is a statistical technique or quasi-experimental design method, and it is used primarily in the social sciences and econometrics. In social science, it is sometimes called a “controlled before-and-after” study.

The DiD method involves comparing results from two groups, with data from each group being recorded over two time periods. One group (the control group) is not exposed to any treatment or intervention whatsoever; the other (treatment group) is exposed to a treatment or intervention before or during one of the two time periods. The same observations are made in both groups over each time period.

The data is analyzed by first calculating the difference in first and second time periods, and then subtracting the average gain (or difference) in the control group from the average gain (or difference) in the treatment group.

## Difference-in-Difference Estimator

In order to make the DiD estimator lets consider the main assumption of DID:
Counterfactual Levels for treated and non-treated can be different, but their time variation is similar.

Let us explain this with statistics:
$$E(Y0(t1) – Y0(t0) | D=1)= E( Y0(t1) – Y0(t0) | D=0)$$
$$[E(Y0(t1)|D=1) is counterf]$$

In the absence of treatment, change in treated outcome would have been as a change in non-treated outcome, i.e. changes in the economy or life-cycle, etc (unrelated to treatment) affect the two groups in a similar way.

This implies an relaxing assumption:
$$E(Y0 | X, D=0) = E( Y0 | X,D=1)$$
$$E( Y1 | X,D=0)= E( Y1 | X,D=1)$$

Selectivity bias is allowed even conditional on X, but only through an individual fixed effect (i.e. time constant).

Let us again consider the main assumption of DID:
$$E(Y0(t1) – Y0(t0) | D=1) = E( Y0(t1) – Y0(t0) | D=0)$$

It is possible to show how this assumption is used to generate a “control group” that can be substituted in for the missing counterfactual:

$$TTE = E( Y1(t1) – Y0(t1) | D=1)$$
$$= E( Y1(t1) – Y0(t0) + Y0(t0) -Y0(t1) | D=1)$$
$$= E( Y1(t1) – Y0(t0) | D=1) – E( Y0(t1) – Y0(t0) | D=1) [2. term unobs]$$
$$= E( Y1(t1) – Y0(t0) | D=1) – E( Y0(t1) – Y0(t0) | D=0) [2. term obs]$$

Formally let us show that this can be written in a regression framework with individual fixed effects and time fixed effects:
$$Y(it) = a(t)+b*D(it)+m(i)+u(it)$$
Where a(t)=time effect, m(i)=individual effect

Then we can write the model as:
$$Y(i1) -Y(i0) = [a(1)-a(0)]+ b*[D(i1)-D(i0)] +[u(i1)-u(i0)]$$

And the fixed-effect estimator reduces to:
$$b = E [Y(i1)-Y(i0)|D=1] – E [Y(i1)-Y(i0)|D=0]$$

Sample version of this is the DiD estimator, where assumption in this model is:
$$E [u(i1)-u(i0)|D=1] = E [u(i1)-u(i0)|D=0]$$

## Methodological Assumptions for Difference in Differences

When doing a DiD method analysis we assume that the composition of the groups being studied are stable over the time period we are concerned about. We also assume there are no spillover effects, the amount of treatment or intervention given is not determined by the outcome, and that both groups being studied have parallel trends in their outcome—i.e., if no treatment was given, the difference between the data from the two groups would have a consistent difference over time.

The most important assumption in DiD is the parallel trends assumption. It is, therefore, necessary to be cautious about studies that do not graphically show these trends. If there is no convincing graph that shows the parallel trends in the pre-treatment outcomes for the treatment and control groups, be cautious. If the parallel trends assumption holds and we can credibly rule out any other time-variant changes that may confound the treatment, then DiD is a trustworthy method.

The difference in difference method is intuitive and fairly flexible; it will show a causal effect from observational data if the basic assumptions are met. Since it focuses on change, rather than the absolute levels, the groups being compared can start at different levels. Another key strong point to the DiD method is that it accounts for change due to factors other than the treatment or intervention being studied.

First we will load the dataset in R and do descriptive statistics:

data_file = '../_data/did_data.dta'
if (!file.exists(data_file)) {
destfile = data_file)
}

skim(dataf, work, year, children) %>%
skimr::kable(digits = 0)
Skim summary statistics
n obs: 13746
n variables: 11
Variable type: numeric
variable    missing    complete      n       mean       sd      p0     p25     p50     p75     p100      hist
----------  ---------  ----------  -------  ---------  ------  ------  ------  ------  ------  ------  ----------
##  children       0        13746      13746     1.19      1.38     0       0       1       2       9      ▇▂▁▁▁▁▁▁
##    work         0        13746      13746     0.51      0.5      0       0       1       1       1      ▇▁▁▁▁▁▁▇
##    year         0        13746      13746    1993.35    1.7     1991    1992    1993    1995    1996    ▇▇▁▇▇▁▆▆



Now let us construct dichotom variables for (a) before-and-after the EITC takes effect in 1994 and (b) the treatment group (1 or more children). We’ll keep these as logicals:

dataf = dataf %>%
mutate(post93 = year >= 1994, anykids = children >= 1)


## Difference-in-difference regression in R

The regression equation is:

work=β0+δ0post93+β1anykids+δ1(anykids×post93)+ε

The “difference-in-differences coefficient” is δ1, which indicates how the effect of kids changed after the EITC went into effect. Let’s take a second to plot this:

ggplot(dataf, aes(post93, work, color = anykids)) + geom_jitter() + theme_minimal()


The above coding gives us the following plot:

Let us also plot the variable means:

ggplot(dataf, aes(year, work, color = anykids)) +
stat_summary(geom = 'line') +
geom_vline(xintercept = 1994) +
theme_minimal()

No summary function supplied, defaulting to mean_se()



The above coding gives us the following plot:

The parallel trends assumption looks good. However, parallel trends are nonlinear.
Let us make the regression, which is created by using a linear probability model, where the DiD coefficient is the interaction coefficient δ1(anykids×post93):

model = lm(work ~ anykids*post93, data = dataf)
summary(model)

Call:
lm(formula = work ~ anykids * post93, data = dataf)

Residuals:
Min      1Q  Median      3Q     Max
-0.5755 -0.4908  0.4245  0.5092  0.5540

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)             0.575460   0.008845  65.060  < 2e-16 ***
anykidsTRUE            -0.129498   0.011676 -11.091  < 2e-16 ***
post93TRUE             -0.002074   0.012931  -0.160  0.87261
anykidsTRUE:post93TRUE  0.046873   0.017158   2.732  0.00631 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4967 on 13742 degrees of freedom
Multiple R-squared:  0.0126,	Adjusted R-squared:  0.01238
F-statistic: 58.45 on 3 and 13742 DF,  p-value: < 2.2e-16



The above model shows that the EITC increased work by 5% among families with at least 1 child. In the second plot, the blue line goes from about 45% prior to 1994 to about 50%, afterward, due to the 5% increase in the model.

Related Post

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

## July 20, 2019

### My useR! 2019 Highlights & Experience: Shiny, R Community, {packages}, and more!

(This article was first published on R by R(yo), and kindly contributed to R-bloggers)

The useR! Conference was held in Toulouse, France and for me this
was my second useR! after my first in Brisbane last year. This time
around I wanted to write about my experiences and some highlights
similar to my post on the RStudio::Conference 2019 & Tidyverse Dev
Day

earlier this year. This blog post will be divided into 4 sections: Programming, Shiny, {Packages}, and Touring Toulouse.

You can find slides and videos (in a week or so) in:

As usual there were many talks that I didn’t get to go to as there are
around 3~5 tracks across different rooms featuring talks on a certain
aspect of R such as Shiny, Modelling, Data handling, DevOps, Education,
the presentations below when they become available from R Consortium’s

Let’s begin!

# Programming

## Enhancements to data tidying: Hadley Wickham

Acknowledging the difficulty of spread() and gather() you might have heard of the creation of the pivot_wider() and pivot_longer() functions in recent
months. You really should take a look at the work-in-progress
Vignette for a comprehensive understanding of the new functions but the
talk featured some live-coding by Hadley
(Code)
and some cool spread/gather animations via Charco Hui’s masters’
thesis
.

For more material you might be interested in Hiroaki Yutani’s tidyr
1.0.0 presentation

from June’s Tokyo.R meetup. It’s mainly in Japanese but there are lots
of code and explanatory graphics that may aid you in visualizing how the
new functions work. You can also read a short English summary of the
talk here.

## n() cool dplyr things: Romain Francois

Taking the tidy data principles into account but for grouped data,
Romain Francois talked about the new group_*() functions in the
{dplyr} package.

While in previous versions of {dplyr} working in a tidy manner with
groups was done with group_by() then dplyr::do(), the latter
function has been deprecated and have been largely replaced by the
{purrr} family of functions instead. In this context the group_map(),
group_modify(), and group_walk() functions iterate like the {purrr}
functions but instead over groups. You can apply the functions you want
to apply to each group inline via a lambda, ~ (as below), or you can
specify a function directly without the lambda.

The group_split() operates similarly to base::split() but splits by
groups, the output being a list of sliced groups. The group_keys()
function returns you the exact grouping structure of the data you used
group_by() on, allowing you to check that the structure is right
before you start applying functions on your data. group_data() and
group_rows() gives you different kind of information about your
grouped data as can be seen below.

To shorten the group_by() %>% summarize() workflow you could instead
use the summarize_at() function. You can select specific columns with
vars(), then actions via a lambda, ~, and you can specify multiple
functions with list().

Romain also talked about the {dance} package which is mainly used to experiment and test out possible new {dplyr} functions by leveraging the
relatively new {vctrs} and {rlang} packages’ features. The package has a theme of using famous dance moves as the function names!

## Reusing tidyverse code – Lionel Henry

Lionel Henry talked about programming using {tidyverse} functions. As an
introduction he went over data masking in {dplyr} and how it is
optimized for interactive coding and single-use %>%s. The usage of
non-standard evaluation (NSE) makes analyses easy as you can focus on the data rather than the
data structure. However, we hit a stumbling block when it comes to when
we want to create custom functions to program with {dplyr}. This is the
difference between computing in the work space (as needed) versus

This is where tidyeval comes into play via {rlang} for flexible and robust programming in the tidyverse. However {rlang} confused a lot of
people due to the strange new syntax it introduced such as the !!,
!!!, and enquo(). Also, it introduced new concepts such as
quasi-quotation and quosures that made it hard to learn for people
especially with those without a programming background. Acknowledging
this obstacle, was introduced to make creating tidyeval functions easier. The new (read as “curly-curly”) operator was
inspired by the {glue} package and is a short cut for !!enquo(var).

# Shiny

## Keynote #2: Shiny apps and Reproducibility – Joe Cheng

Compared to a R script or R Markdown document, reproducibility suffers
in Shiny apps as the outputs are transient and not archivable.
RStudio’s Joe Cheng talked about how reproducible analysis with Shiny is inconvenient as reenacting the
user’s interaction steps is necessary. A case for having a simple
various industries such as:

• ex. Drug research/pharma validation (workflow)
• ex. Teaching: statistical concepts and code snippets
• ex. Gadgets/add-ins: building ggplots, regex, and SQL queries then
insert the code into source/console editor

The different possible outputs we might want from a Shiny app are:

• To download a ZIP with source code & data, other supporting files,
and the actual rendered result

From there Joe talks about how there are a number of options available
such as :

1. Copy-paste: Have a Shiny app and RMD report
• Pros: Copy-pasted code is high fidelity and easy to understand
• Cons: Two copies must be kept in sync and method will not work for
more dynamic apps
2. Lexical analysis: automatically generate scripts from app source
code (static analysis and heuristics)

• Pros: Easy to add to app
• Cons: Not ALL apps can be translated automatically
• Generated code may not be camera ready as it may contain lots of
code relating to the Shiny app’s structure
3. Programmatic: Meta-programming techniques to write code for dual
purposes (execute interactive and export static)

• Pros: Flexible
• Cons: High learning curve and significant effort needed to adapt
old Shiny apps

In light of the various pros and cons of the above options Joe with the
help of Carson Sievert created the…

### {shinymeta} package

There are four main steps to follow when using {shinymeta}:

1. Identify the domain logic inside the code and separate it from
Shiny’s reactive structure

• Activate meta mode with withMetaMode() or expandChain()
• Use metaReactive() to create a reactive() that returns a code
expression
• Other functions to return code include metaObserve(),
metaRender(), etc.
• You can also wrap the code you want with metaExpr() inside
function

1. Within the domain logic you identified, identify references to
reactive values and expressions that need to be replaced with static
values and static code

• De-reference reactive values with !!
• Replace reactive values with the actual values

1. At run time, choose which pieces of domain logic to expose to
the user

• expandChain(): turns !! code into variable and introduces code
snippet above the function
• The chain of variable declarations grow upwards as you sequentially
expand the meta-objects

1. Present the code to the user!
• Use outputCodeButton() to add a button for a specific output
• Use displayCodeModal() to display underlying code
• Use downloadButton() to allow people to click and download a R
script or RMD report
• Use buildScriptBundle or buidlRmdBundle() to generate .zip
bundles dynamically

Some of the limitations and future directions Joe, Carson, and the rest
of the Shiny team acknowledge are that:

• The formatting of the code can be improved (white
space not preserved)
• Future compatibility with Shiny async
• So far {shinymeta} only covers reproducing “snapshots” of the app
state
• More work and thinking needs to be done to reproduce a “notebook”
style record of the how/why/what of the multiple iterations of
interactive usage that was needed to get to a certain result and
output

There’s a lot to take in (this was probably the toughest talk for me to
explain in this post…), so besides watching the keynote
talk
yourself you
can also take a look at the shinymeta package
website
.

## {golem}: Shiny apps in production – Vincent Guyader

Vincent Guyader, from another French R organization
ThinkR, talked about the new {golem}
package which creates a nice framework for building robust

One of the key principles in R is when you are repeatedly writing or
using the same code or functions then you should write a package, and
this is no different for Shiny apps as well. The reasons Vincent stated
were:

• Easy dependency, version, documentation management
• Easy installation and deployment

With the package infrastructure, you need to have the ui.R and
server.R (app_ui.R and app_server.R respectively in {golem}) in
the R directory and all you need to run your app is the run_app()
function.

{golem} also has functions that make it easy to deploy your app via R
Studio Connect, shinyproxy, Shiny server, heroku, etc.

For styling your app with customized JavaScript and CSS files you can
add_js_file() and add_css_file() functions. You can do similar but
with modules with add_module(). As {golem} is a package you have all
the great attributes of an R package available to you such as unit
testing, documentation, and continuous integration/deployment!

## Our journey with Shiny: Some packages to enhance your applications – Victor Perrier & Fanny Meyer

Victor Perrier and Fanny Meyer from dreamRs talked about
the various Shiny packages that can extend the functionality of your
Shiny applications!

The first and probably the most well-known of this group is the
{shinyWidgets} package which gives you a variety of cool custom widgets that you can add to make your Shiny app via JavaScript and CSS.

Next, wondering about how exactly users interacted with their Shiny apps
and whether they used the included widgets the dreamRs team created the
{shinylogs} package. This packages records any and all inputs that are changed as well as the outputs and errors. This is done by storing the
JavaScript objects via the
localForage JavaScript
library. With this in place shiny developers can see the number of
connections per day, the user agent family, most viewed tabs, etc.

The {shinybusy} package gives a user feedback when a server operation running or busy such as a spinning circle, a moving bar, or even any
kind of gif you choose!

Last but not least is the {shinymanager} package which allows you to administrate and manage who can access your application and protects the source code of your app until authentication is successful!

The dreamRs organization are also the organization that created the
{esquisse} package that lets you interactively make ggplot2 graphs with an RStudio addin!

# Packages

## Summary of developments in R’s data.table package – Arun Srinivasan

I’ve been curious about data.table so I decided to go to this talk
to learn more from Arun Srinivasan, one of the authors of the package. Starting off
with some trivia, I finally learned that the reason for the seal on the
hex sticker is because seals make an “aR! aR! aR!” sound according to
{data.table} creator Matt Dowle, which I thought was pretty great!

Compared to a year ago there has been a lot of change and progress in
{data.table}:

A key principle of {data.table} is that there are no dependencies or
imports in the package!

The general form of using {data.table} is as follows:

Arun also showed us some examples:

At the end he also talked about the new optimization and functionalities
in the package.

• for ‘i’: auto-indexing and parallel subsets (columns processed in
parallel)
• for ‘j’: using GForce
• for ‘by’: parallelization of radix ordering
• new functionality: froll(), coalesce(), and nafill()

At the end of the talk Arun thanked the 69 people (among them Michael
Chirico
, Philippe Chataignon, Jan Gorecki, etc.) who have contributed a
lot to what {data.table} is today!

## {polite} – Dmytro Perepolkin

The {polite} package is one I’ve been used for over a year now (you
might’ve seen me use it in my soccer or TV data viz) and I was delighted
to hear that the creator was giving a LT on it! Dmytro began with a few do’s and don’ts concerning user-agents and being explicit about them:

Secondly, you should always check the robots.txt for the website which is a file that
stipulates various conditions for scraping activity. This can be done
via Peter Meissner’s {robotstxt} package or by checking the output from polite::bow("theWebsiteYouAreScraping.com")(polite::bow() function
is what establishes the {polite} session)!

After getting permission you also need to limit the rate at which you
scrape, you don’t want to overload the servers of the website you are
using, so no parallelization! This can be done with the {ratelimitr}
package, purrr::slowly() while the {polite} package automatically
delays by 5 seconds when you run polite::scrape().

After scraping, you should definitely cache your responses with {memoise}, which is what is used inside the polite::scrape() function. Also, wrap your scraper function
with something like purrr:::safely() so it returns a list of two
components, a “result” for successes and “error” object for errors in

You can also read his blog post on the talk
here which explains a
bit more about the polite::use_manners() function that allows you to
include {polite} scrapers into your own R packages.

## goodpractice: good pkg development Hannah Frick

Hannah Frick from Mango Solutions talked about {goodpractice}, a package that gives you advice about good practices for building an R package. By
running goodpractice::gp() it does static code analysis and can run
around ~200 of the checks available.

A cool thing you can do is that you can customize the different checks
it runs, set your own standards beforehand and run the checks based on
those standards with the make_check() and make_prep() functions.
It’s a great package that I’ve used before at work and for my own
packages so definitely try it out!

# R Community

## The development of {datos} package for the R4DS Spanish translation – Riva Quiroga

the “R for Data Science” book and R data sets into Spanish. This came
about as a fact that learning R (or any programming language) can be
tough for a non-English speaker as it means you have to not only learn
the programming but figuring out what the documentation and use cases in
English even mean. To address this language gap the R4DS Spanish
translation community project was born, Ciencia de
Datos
on Github! Through Github and
slack the organization sought to translate both the book and the various
data sets available in base R, for example: turning “diamonds” into “dimantes”.
However, they found that simply trying to rename() everything was not
sustainable so they had to find an alternative. This alternative ended
up being the {datalang}
package.

This package (created by RStudio’s Edgar Ruiz) uses a YAML spec file
translating to the language you want for the variable names, value
names, help files, etc. After creating the spec file you just have to
add it as an argument into the
datalang::translate_data()/translate_folder() function and you’ll have
a translated data set! The R para Ciencia de Datos Twitter also hosts a Spanish version of #TidyTuesday called #DatosDeMiercoles so check it
out!

Another thought I had after this presentation was that maybe this might
be a good idea for Japanese?

## R Consortium Working Groups – Joseph Rickert

RStudio’s Joe Rickert talked about R
Consortium’s Workings Groups which is an initiative to foster innovation
among individuals and companies. Any individual or a group can apply to
create a working group to explore what R and other technologies can do
in a certain field of interest. Throughout the talk Joe gave examples of
successful workings groups such as:

As advice for potential working groups Joe said that one should pick a
project with a very wide scope which can benefit from collaboration
between members and which can benefit a large portion of the R
community.

## Keynote: #3 ‘AI for Good’ in the R and Python ecosystems – Julien Cornebise

In the last keynote of the conference Julien Cornebise talked about using technology tools for good using lots of examples throughout his
life for both good and bad projects.

Here are some quotes I was able to jot down:

On using technology for good:

“Technology is not a solution it is an accelerator, essentially you just have a better optimizer, you’re just fitting better to the incentives we have around us a society.”

On the motivation of getting involved in #DataForGood projects:

“Are you here to solve the problem or are you here for a really cool application of your fantastic new theory and algorithm?”

On “hackathon syndrome” of many solutions to #DataForGood problems:

“Github is a big cemetary of really good ideas … where do we find software engineers, where do we find the designers, how do we go from the solution to the project to a real product that can be used by many many people?”

Some of the projects he talked about were:

• Decode
Darfur
:
Identifying remote burnt/destroyed villages in the Darfur region to
provide credible evidence that they had been attacked by the
Sudanese government and allies.
• Troll
Patrol
:
Quantitative analysis of online abuse and violence against UK and US

This is definitely a talk I would recommend everybody to watch and you
can do so from here!

# Tour Toulouse!

As I was only heading home on the following Monday, I had the entire
weekend to explore Toulouse! I was staying near the Capitole and as
Toulouse is pretty walkable I didn’t have to use public transportation
at all during my stay. I think I just about walked every street in the
city center! Unfortunately, the Musee de Augustins was closed but I was
able to visit most of the other sites! Below are some pictures:

Sunday was also Bastille Day so there were some fireworks on display as
well. All in all I had a great time in Toulouse!

# Conclusion

This was my second useR! Conference and I enjoyed it quite a lot, not to
mention I got to do some sightseeing which I wasn’t able to do much of
in Brisbane last year. I met a lot of people that I follow on Twitter
and I’ve had people come up to me who recognized me from all the data
viz/blog posts I do (a first for me) which was really cool (and it helps
as I’m very nervous about approaching people especially since they are
usually surrounded by other people and I don’t want to interrupt their
conversation and… “Oh no it’s time for the next session!”, etc.)!

During a post-conference dinner I had with a dozen or so random
R users that were still in Toulouse (including Diane Cook, Will Chase, Saras Windecker, Irene Steves, Alimi Eyitayo among others – and some that I didn’t even get to talk to because our group was so big) we all talked about how important the
community is. With how open everything is in regards to the talks
being recorded and the materials being put online you don’t necessarily
have to come all the way to the conference to be able to learn the material.
However, the important component of these conferences is being able to talk to the people and engaging with the community which
is something I’ve really felt to be a part of since I started R and
going to conferences in the past 2 years or so. I think nearly each one
of the people I sat with on the table at dinner that night came from a
different country and worked in completely different areas which made
for some real eye-opening discussion about how R is used worldwide and across industries. I
also learned about cultural differences in tech, especially women in tech
in Nigeria from Alimi Eyitayo (who
also gave a talk on Scaling useR Communities with Engagement and
Retention
Models
at the
conference).

There were still a ton of great talks I missed so I’m excited to watch the rest on Youtube. I think I will be at RStudio::Conference next year in San Francisco so hopefully I’ll see some of you there!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Distilled News

Measuring forecast accuracy (or error) is not an easy task as there is no one-size-fits-all indicator. Only experimentation will show you what Key Performance Indicator (KPI) is best for you. As you will see, each indicator will avoid some pitfalls but will be prone to others.
As data science engineer it’s imperative that the sample data set which you pick from the population data is reliable, clean and well tested for its usability in Machine learning model building. So how do you do that ? Well, we have multiple statistical techniques like descriptive statistic where we measure the data central value, how it is spread across the mean/median. Is it normally distributed or there is a skew in the data spread. Please refer my previous article on the same for more clarity.
HDSR is an open access journal of the Harvard Data Science Initiative published by the MIT Press.
It has been another intense year in the world of data, full of excitement but also complexity. As more of the world gets online, the ‘datafication’ of everything continues to accelerate. This mega-trend keeps gathering steam, powered by the intersection of separate advances in infrastructure, cloud computing, artificial intelligence, open source and the overall digitalization of our economies and lives.
Part I of the 2019 Data & AI Landscape covered issues around the societal impact of data and AI, and included the landscape chart itself. In this Part II, we’re going to dive into some of the main industry trends in data and AI. The data and AI ecosystem continues to be one of the most exciting areas of technology. Not only does it have its own explosive momentum, but it also powers and accelerates innovation in many other areas (consumer applications, gaming, transportation, etc). As such, its overall impact is immense, and goes much beyond the technical discussions below. Of course, no meaningful trend unfolds over the course of just one year, and many of the following has been years in the making. We’ll focus the discussion on trends that we have seen particularly accelerating in 2019, or gaining rapid prominence in industry conversations. We will loosely follow the order of the landscape, from left to right: infrastructure, analytics and applications.
After 10 years of ImageNet, AI researchers are digging into the details of test sets and some are asking just how much knowledge has really been created with machine learning.
A new paper shows how natural-language processing can accelerate scientific discovery.
The context: Natural-language processing has seen major advancements in recent years, thanks to the development of unsupervised machine-learning techniques that are really good at capturing the relationships between words. They count how often and how closely words are used in relation to one another, and map those relationships in a three-dimensional vector space. The patterns can then be used to predict basic analogies like ‘man is to king as woman is to queen,’ or to construct sentences and power things like autocomplete and other predictive text systems.
New application: A group of researchers have now used this technique to munch through 3.3 million scientific abstracts published between 1922 and 2018 in journals that would likely contain materials science research. The resulting word relationships captured fundamental knowledge within the field, including the structure of the periodic table and the way chemicals’ structures relate to their properties. The paper was published in Nature last week.
Implement some of the core OOP principles in a machine learning context by building your own Scikit-learn-like estimator, and making it better.
Let’s understand how to do an approach for multiclass classification for text data in Python through identify the type of news based on headlines and short descriptions.
Software is usually designed as a choose-your-own-adventure affair. To complete tasks, users move through an application by making a series of choices based on available options. This can include choosing an item from a menu, choosing the appropriate tool from a toolbar, or selecting a piece of content from a list. The user is always free to decide for themselves, but the design and presentation of these options has the power to greatly influence the choices they make.
Ever wished someone would just tell you what the point of statistics is and what the jargon means in plain English? Let me try to grant that wish for you! I’ll zoom through all the biggest ideas in statistics in 8 minutes! Or just 1 minute, if you stick to the large font bits.
ONNX is an open format to represent deep learning models. With ONNX, AI developers can more easily move models between state-of-the-art tools and choose the combination that is best for them. ONNX is developed and supported by a community of partners.
BERT (Bidirectional Encoder Representations from Transformers) is a recent paper published by researchers at Google AI Language. It has caused a stir in the Machine Learning community by presenting state-of-the-art results in a wide variety of NLP tasks, including Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), and others. BERT’s key technical innovation is applying the bidirectional training of Transformer, a popular attention model, to language modelling. This is in contrast to previous efforts which looked at a text sequence either from left to right or combined left-to-right and right-to-left training. The paper’s results show that a language model which is bidirectionally trained can have a deeper sense of language context and flow than single-direction language models. In the paper, the researchers detail a novel technique named Masked LM (MLM) which allows bidirectional training in models in which it was previously impossible.
At the end of June, Motherboard reported on a new app called DeepNude, which promised – ‘with a single click’ – to transform a clothed photo of any woman into a convincing nude image using machine learning. In the weeks since this report, the app has been pulled by its creator and removed from GitHub, though open source copies have surfaced there in recent days. Most of the coverage of DeepNude has focused on the specific dangers posed by its technical advances. ‘DeepNude is an evolution of that technology that is easier to use and faster to create than deepfakes,’ wrote Samantha Cole in Motherboard’s initial report on the app. ‘DeepNude also dispenses with the idea that this technology can be used for anything other than claiming ownership over women’s bodies.’ With its promise of single-click undressing of any woman, it made it easier than ever to manufacture naked photos – and, by extension, to use those fake nudes to harass, extort, and publicly shame women everywhere. But even following the app’s removal, there’s a lingering problem with DeepNude that goes beyond its technical advances and ease of use. It’s something older and deeper, something far more intractable – and far harder to erase from the internet – than a piece of open source code.

### Finding out why

I remember the first time I saw a computer, it was a Power Macintosh 5260 (with Monkey Island on it). I was around 5 years old and I looked at it as if it belonged to another universe. It did, I was not allowed to get anywhere close to it within a 5 mile radius; it was my older brother’s! That did not stop me. I browsed it for hours. The possibilities of computers were infinite and fuelled by the inspiration of sci-fi worlds the dream of talking machines, machines that can assist humans, think themselves and even have feelings never stopped. I kept dreaming about the possibilities of the future.
One of the questions that arises when designing models that learn to solve multiple tasks simultaneously is how much of the available training budget should be devoted to each individual task. We refer to any formalized approach to addressing this problem (learned or otherwise) as a task selection policy. In this work we provide an empirical evaluation of the performance of some common task selection policies in a synthetic bandit-style setting, as well as on the GLUE benchmark for natural language understanding. We connect task selection policy learning to existing work on automated curriculum learning and off-policy evaluation, and suggest a method based on counterfactual estimation that leads to improved model performance in our experimental settings.
The use of machine learning systems to support decision making in healthcare raises questions as to what extent these systems may introduce or exacerbate disparities in care for historically underrepresented and mistreated groups, due to biases implicitly embedded in observational data in electronic health records. To address this problem in the context of clinical risk prediction models, we develop an augmented counterfactual fairness criteria to extend the group fairness criteria of equalized odds to an individual level. We do so by requiring that the same prediction be made for a patient, and a counterfactual patient resulting from changing a sensitive attribute, if the factual and counterfactual outcomes do not differ. We investigate the extent to which the augmented counterfactual fairness criteria may be applied to develop fair models for prolonged inpatient length of stay and mortality with observational electronic health records data. As the fairness criteria is ill-defined without knowledge of the data generating process, we use a variational autoencoder to perform counterfactual inference in the context of an assumed causal graph. While our technique provides a means to trade off maintenance of fairness with reduction in predictive performance in the context of a learned generative model, further work is needed to assess the generality of this approach.
Learning to Rank (LTR) from user interactions is challenging as user feedback often contains high levels of bias and noise. At the moment, two methodologies for dealing with bias prevail in the field of LTR: counterfactual methods that learn from historical data and model user behavior to deal with biases; and online methods that perform interventions to deal with bias but use no explicit user models. For practitioners the decision between either methodology is very important because of its direct impact on end users. Nevertheless, there has never been a direct comparison between these two approaches to unbiased LTR. In this study we provide the first benchmarking of both counterfactual and online LTR methods under different experimental conditions. Our results show that the choice between the methodologies is consequential and depends on the presence of selection bias, and the degree of position bias and interaction noise. In settings with little bias or noise counterfactual methods can obtain the highest ranking performance; however, in other circumstances their optimization can be detrimental to the user experience. Conversely, online methods are very robust to bias and noise but require control over the displayed rankings. Our findings confirm and contradict existing expectations on the impact of model-based and intervention-based methods in LTR, and allow practitioners to make an informed decision between the two methodologies.
We offer a graphical interpretation of unfairness in a dataset as the presence of an unfair causal path in the causal Bayesian network representing the data-generation mechanism. We use this viewpoint to revisit the recent debate surrounding the COMPAS pretrial risk assessment tool and, more generally, to point out that fairness evaluation on a model requires careful considerations on the patterns of unfairness underlying the training data. We show that causal Bayesian networks provide us with a powerful tool to measure unfairness in a dataset and to design fair models in complex unfairness scenarios.
Marginal structural models (MSM) with inverse probability weighting (IPW) are used to estimate causal effects of time-varying treatments, but can result in erratic finite-sample performance when there is low overlap in covariate distributions across different treatment patterns. Modifications to IPW which target the average treatment effect (ATE) estimand either introduce bias or rely on unverifiable parametric assumptions and extrapolation. This paper extends an alternate estimand, the average treatment effect on the overlap population (ATO) which is estimated on a sub-population with a reasonable probability of receiving alternate treatment patterns in time-varying treatment settings. To estimate the ATO within a MSM framework, this paper extends a stochastic pruning method based on the posterior predictive treatment assignment (PPTA) as well as a weighting analogue to the time-varying treatment setting. Simulations demonstrate the performance of these extensions compared against IPW and stabilized weighting with regard to bias, efficiency and coverage. Finally, an analysis using these methods is performed on Medicare beneficiaries residing across 18,480 zip codes in the U.S. to evaluate the effect of coal-fired power plant emissions exposure on ischemic heart disease hospitalization, accounting for seasonal patterns that lead to change in treatment over time.

### Magister Dixit

“There will be no AI worthy of the name without causal inference.” Miguel A. Hernán, John Hsu, Brian Healy ( July 12, 2018 )

### If you did not already know

Bonsai
Extreme multi-label classification refers to supervised multi-label learning involving hundreds of thousand or even millions of labels. In this paper, we develop a shallow tree-based algorithm, called Bonsai, which promotes diversity of the label space and easily scales to millions of labels. Bonsai relaxes the two main constraints of the recently proposed tree-based algorithm, Parabel, which partitions labels at each tree node into exactly two child nodes, and imposes label balanced-ness between these nodes. Instead, Bonsai encourages diversity in the partitioning process by (i) allowing a much larger fan-out at each node, and (ii) maintaining the diversity of the label set further by enabling potentially imbalanced partitioning. By allowing such flexibility, it achieves the best of both worlds – fast training of tree-based methods, and prediction accuracy better than Parabel, and at par with one-vs-rest methods. As a result, Bonsai outperforms state-of-the-art one-vs-rest methods such as DiSMEC in terms of prediction accuracy, while being orders of magnitude faster to train. The code for \bonsai is available at https://…/bonsai.

HybridNet
In this paper, we introduce a new model for leveraging unlabeled data to improve generalization performances of image classifiers: a two-branch encoder-decoder architecture called HybridNet. The first branch receives supervision signal and is dedicated to the extraction of invariant class-related representations. The second branch is fully unsupervised and dedicated to model information discarded by the first branch to reconstruct input data. To further support the expected behavior of our model, we propose an original training objective. It favors stability in the discriminative branch and complementarity between the learned representations in the two branches. HybridNet is able to outperform state-of-the-art results on CIFAR-10, SVHN and STL-10 in various semi-supervised settings. In addition, visualizations and ablation studies validate our contributions and the behavior of the model on both CIFAR-10 and STL-10 datasets. …

Sure Thing Principle (STP)
In 1954, Jim Savage introduced the Sure Thing Principle to demonstrate that preferences among actions could constitute an axiomatic basis for a Bayesian foundation of statistical inference. Here, we trace the history of the principle, discuss some of its nuances, and evaluate its significance in the light of modern understanding of causal reasoning. The sure-thing principle (STP) was introduced by L.T. Savage using the following story: ‘A businessman contemplates buying a certain piece of property. He considers the outcome of the next presidential election relevant. So, to clarify the matter to himself, he asks whether he would buy if he knew that the Democratic candidate were going to win, and decides that he would. Similarly, he considers whether he would buy if he knew that the Republican candidate were going to win, and again finds that he would. Seeing that he would buy in either event, he decides that he should buy, even though he does not know which event obtains, or will obtain, as we would ordinarily say.’ …

BayesNAS
One-Shot Neural Architecture Search (NAS) is a promising method to significantly reduce search time without any separate training. It can be treated as a Network Compression problem on the architecture parameters from an over-parameterized network. However, there are two issues associated with most one-shot NAS methods. First, dependencies between a node and its predecessors and successors are often disregarded which result in improper treatment over zero operations. Second, architecture parameters pruning based on their magnitude is questionable. In this paper, we employ the classic Bayesian learning approach to alleviate these two issues by modeling architecture parameters using hierarchical automatic relevance determination (HARD) priors. Unlike other NAS methods, we train the over-parameterized network for only one epoch then update the architecture. Impressively, this enabled us to find the architecture in both proxy and proxyless tasks on CIFAR-10 within only 0.2 GPU days using a single GPU. As a byproduct, our approach can be transferred directly to compress convolutional neural networks by enforcing structural sparsity which achieves extremely sparse networks without accuracy deterioration. …

### Gendered languages and women’s workforce participation rates

Rajesh Venkatachalapathy writes:

I recently came across a world bank document claiming that gendered languages reduce women’s labor force participation rates. It is summarized in the following press release: Gendered Languages May Play a Role in Limiting Women’s Opportunities, New Research Finds.

This sounds a lot like the piranha problem, if there is any effect at all.

I [Venkatachalapathy] am disturbed by claims of large effects in their study. Their work seems to rely conceptually on the Sapir-Whorf hypothesis in linguistics, which is also quiet controversial on its own. I am curious to know what your take is on this report.

He continues:

The cognitive science behind Sapir-Whorf, and the related field of embodied cognition in general is quiet controversial; it appeals to so many people, yet has very weak evidence (see for example, the recent book by McWhorter). This paper seems to magnify this to say something so strong about macroeconomic labor market demographic indicators. I cannot avoid comparisons with Pinker’s hypothesis in his most recent book that enlightenment thought and secular humanistic principles derived from it has been one of the primary drivers of the civilizing process of the Norbert Elias kind or the Pinker kind.

I am not claiming that such macro-level claims can never be justified. For example, I just began reading your academic colleague, economist Suresh Naidu’s recent paper on how democratization in countries causes economic growth. From the looks of it, they seem to have worked hard at establishing their main hypothesis. Maybe, their [Naidu or his collaborators] approach might provide us with additional insight on whether the causal claims of the paper on gendered language and workforce participation is reasonable and defensible with existing data, and with their [the paper’s] data analysis approach. I just find it difficult to imagine how a psychologically weak effect can suddenly become magnified when scaled to level of large scale societies.

After having trained hard to be skeptical of all causal claims over the years, I see what I feel is an epidemic of causal claims popping up in the literature and I find it hard to believe them all, especially given the fact that progress in philosophical causality and causal inference has been only incremental.

My response: I agree that such claims from observational data in cross-country and cross-cultural comparisons can be artifactual, and languages are correlated with all sorts of things. I don’t know enough about the topic to say more.

### Science and Technology links (July 20th 2019)

1. Researchers solve the Rubik’s cube puzzle using machine learning (deep learning).
2. There has been a rise in the popularity of “deep learning” following some major breakthroughs in tasks like image recognition. Yet, at least as far as recommender systems are concerned, there are reasons to be skeptical of the good results being reported:

In this work, we report the results of a systematic analysis of algorithmic proposals for top-n recommendation tasks. Specifically, we considered 18 algorithms that were presented at top-level research conferences in the last years. Only 7 of them could be reproduced with reasonable effort. For these methods, it however turned out that 6 of them can often be outperformed with comparably simple heuristic methods, e.g., based on nearest-neighbor or graph-based techniques. The remaining one clearly outperformed the baselines but did not consistently outperform a well-tuned non-neural linear ranking method. Overall, our work sheds light on a number of potential problems in today’s machine learning scholarship and calls for improved scientific practices in this area.

3. Your blood contains about four grams of glucose/sugar (it is a tiny amount).
4. Our brain oscillates at a rate of about 40 Hz (40 times per second). Some researchers may have found the cells responsible for coordinating these waves.
5. Are we suffering from record-setting heat waves? A recent American government report concludes that we are not:

(…) the warmest daily temperature of the year increased in some parts of the West over the past century, but there were decreases in almost all locations east of the Rocky Mountains. In fact, all eastern regions experienced a net decrease (…), most notably the Midwest (about 2.2°F) and the Southeast (roughly 1.5°F). (…) As with warm daily temperatures, heat wave magnitude reached a maximum in the 1930s.

The same report observes that cold extremes have become less common, however.

### GEDCOM Reader for the R Language: Analysing Family History

(This article was first published on R Language – The Lucid Manager, and kindly contributed to R-bloggers)

Understanding who you are is strongly related to understanding your family history. Discovering ancestors is now a popular hobby, as many archives are available on the internet. The GEDCOM format provides a standardised way to store information about ancestors. This article shows how to develop a GEDCOM reader using the R language.

## The GEDCOM format

The GEDCOM format is not an ideal way to store information, but it has become the de-facto standard for family history. This format includes metadata and two sets of data. The file contains a list of the individuals, and it lists the families to which they belong.

The basic principle is that each line has a level, indicated by the first digit. At level zero, we find metadata and the individuals and their family. At level one, we see the various types of data, such as births, deaths and marriages. The deeper levels provide the data for these events.

Heiner Eichmann maintains a website that explains the format and provides some examples of files to help you understand the syntax.

The GEDCOM format is not only old in the way it stores data, but it is also limited in the types of human relationships. These files only store genetic relationships between people and assume that these relationships are marriages between a wife and a husband. Human relationships are, however, a lot more complicated than the genetic relationships between children and their parents, grandparents and ancestors.

These issues aside, all genealogical software can export a file to GEDCOM. The next section shows how to create a basic GEDCOM reader using the stringr, tibble and dplyr packages from the Tidyverse.

The read.gedcom() function takes a GEDCOM file as input and delivers a data frame (tibble) with basic information:

• ID
• Full name
• Gender
• Birthdate and place
• Father
• Mother
• Death date and place

This code only can be easily expanded to include further fields by adding lines in the while-loops and including the fields in the data frame.

The first lines read the file and setup the data frame. The extract() function extracts an individual’s ID from a line in the file. The for loop runs through each line of the GEDCOM file. When the start of a new individual is detected, the GEDCOM reader collects the relevant information.

Births and christenings are considered equal to simplify the data. In older data, we often only know one or the other. The function looks for the start of a family. It extracts the husband and wife and assigns these as parents to each of the children. The last section cleans the data and returns the result.

## Read GEDCOM file

## The Devil is in the Data
## lucidmanager.org/data-science
## Dr Peter Prevos

require(stringr)
require(tibble)
require(dplyr)

idv <- sum(grepl("^0.*INDI$", gedcom)) fam <- sum(grepl("^0.*FAM$", gedcom))
cat(paste("Individuals: ", idv, "\n"))
cat(paste("Families: ", fam, "\n"))
family <- tibble(id = NA,
Full_Name = NA,
Gender = NA,
Birth_Date = NA,
Birth_Place = NA,
Father_id = NA,
Mother_id = NA,
Death_Date = NA,
Death_Place = NA)
## Extract data
extract <- function(line, type) {
str_trim(str_sub(line, str_locate(line, type)[2] + 1))
}
id <- 0
for (l in 1:length(gedcom)) {
if (str_detect(gedcom[l], "^0") & str_detect(gedcom[l], "INDI$")) { id <- id + 1 family[id, "id"] <- unlist(str_split(gedcom[l], "@"))[2] l <- l + 1 while(!str_detect(gedcom[l], "^0")) { if (grepl("NAME", gedcom[l])) family[id, "Full_Name"] <- extract(gedcom[l], "NAME") if (grepl("SEX", gedcom[l])) family[id, "Gender"] <- extract(gedcom[l], "SEX") l <- l + 1 if (grepl("BIRT|CHR", gedcom[l])) { l <- l + 1 while (!str_detect(gedcom[l], "^1")) { if (grepl("DATE", gedcom[l])) family[id, "Birth_Date"] <- extract(gedcom[l], "DATE") if (grepl("PLAC", gedcom[l])) family[id, "Birth_Place"] <- extract(gedcom[l], "PLAC") l <- l + 1 } } if (grepl("DEAT|BURI", gedcom[l])) { l <- l + 1 while (!str_detect(gedcom[l], "^1")) { if (grepl("DATE", gedcom[l])) family[id, "Death_Date"] <- extract(gedcom[l], "DATE") if (grepl("PLAC", gedcom[l])) family[id, "Death_Place"] <- extract(gedcom[l], "PLAC") l <- l + 1 } } } } if (str_detect(gedcom[l], "^0") & str_detect(gedcom[l], "FAM")) { l <- l + 1 while(!str_detect(gedcom[l], "^0")) { if (grepl("HUSB", gedcom[l])) husband <- unlist(str_split(gedcom[l], "@"))[2] if (grepl("WIFE", gedcom[l])) wife <- unlist(str_split(gedcom[l], "@"))[2] if (grepl("CHIL", gedcom[l])) { child <- which(family$id == unlist(str_split(gedcom[l], "@"))[2])
family[child, "Father_id"] <- husband
family[child, "Mother_id"] <- wife
}
l <- l + 1
}
}
}
family %>%
mutate(Full_Name = gsub("/", "", str_trim(Full_Name)),
Birth_Date = as.Date(family$Birth_Date, format = "%d %b %Y"), Death_Date = as.Date(family$Death_Date, format = "%d %b %Y")) %>%
return()
}

## Analysing the data

There are many websites with GEDCOM files of family histories of famous and not so famous people. The Famous GEDCOMs website has a few useful examples to test the GEDCOM reader.

Once the data is in a data frame, you can analyse it any way you please. The code below downloads a file with the presidents of the US, with their ancestors and descendants. The alive() function filters people who are alive at a certain date. For people without birth date, it sets a maximum age of 100 years.

The histogram shows the distribution of ages at time of death of all the people in the presidents file.

These are just some random examples of how to analyse family history data with this GEDCOM reader. The next article will explain how to plot a population pyramid using this data. A future article will discuss how to visualise the structure of family history.

## Basic family history statistics
library(tidyverse)
library(lubridate)

filter(presidents, grepl("Jefferson", Full_Name))

mutate(presidents, Year = year(Birth_Date)) %>%
ggplot(aes(Year)) +
geom_histogram(binwidth = 10, fill = "#6A6A9D", col = "white") +
labs(title = "Birth years in the presidents file")
ggsave("../../Genealogy/years.png")

alive <- function(population, census_date){
max_date <- census_date + 100 * 365.25
filter(people, (is.na(Birth_Date) & (Death_Date <= max_date &
Death_Date >= census_date)) |
(Birth_Date <= census_date & Death_Date >= census_date)) %>%
arrange(Birth_Date) %>%
mutate(Age = as.numeric(census_date - Birth_Date) / 365.25) %>%
return()
}

alive(presidents, as.Date("1840-03-07"))

The post GEDCOM Reader for the R Language: Analysing Family History appeared first on The Lucid Manager.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Impressions from useR! 2019

(This article was first published on Mirai Solutions, and kindly contributed to R-bloggers)

This year, the greater R community gathering useR! took place in sunny Toulouse in July, bringing together over 1000 practitioners from both academia and industry.

The event spanned over five days, including:

• a tidyverse day
• one full day of workshops
• 6 keynotes and a few sponsor talks
• contributed talks and lightning talks over 6 parallel sessions
• a poster session
• a rich social program.

Here are a few impressions from Francesca, Nikki and Riccardo who attended the conference on behalf of Mirai Solutions.

#### Satellite Events

On Monday, ahead of the main conference, multiple satellite activities were available for the R aficionados already in town.

Forwards organized a series of meetings between forwards friends, diversity scholarship awardees and the “newbies”.

A tidyverse developer day also took place at the conference venue, where attendees could work on tidyverse-related issues, learn about coding best practices and documentation, as well as being exposed to git and the development workflow.

#### Tutorial Day

An extremely rich offer of tutorials, with two blocks of nine parallel sessions, made it very difficult to make up our mind and choose where to go. Topics ranged from development workflows to visualization, Shiny, machine learning, data analysis and modeling. This showed once again how diverse the R language is!

The day closed with a welcome apero hosted by the Mayor of Toulouse in the City Hall and a hilarious translation from French to nerd-English by Romain François.

#### First Conference Day

What better start than with Music? The useR! 2019 organizers set the mood right by hosting the performance of Moultaqa Salam (“Meet and peace”), a group melting perfectly the different souls and sounds of South-West Europe and North Africa.

Julia Stewart Lowndes, aka @juliesquid opened with an inspiring talk about open source and collaboration. R can be The Force to do better science in less time and Twitter can help building the community where sharing and communication can take place.

The famous wizard Hadley Wickham gave a talk about the latest enhancements of data tidying, proving that he does not just live in the internet, but also attends conferences!

Next a unicorn made it to the stage to show us the many little tricks hidden in dplyr. He had a very nice little helper, which reminds us all of the importance of supporting working parents with the potential difficulties they may face.

Miraier Riccardo gave a talk on Shiny app deployment and integration into a custom website gallery, where he presented the challenges and steps for embedding and integrating a Shiny app into an existing website. As an example he used SmaRP, a Shiny app designed to guide people working in Switzerland towards a strategic retirement plan, previously presented at eRum in 2018.

Shiny has been a prominent topic with two sessions covering aspects ranging from Shiny development workflows with package golem to scalable, enterprise-level applications.

Julie Josse closed the day’s work with an academic-flavoured presentation about her research on dealing with missing numbers. However the fun continued with the conference dinner, taking place at the amazing Cité de l’Espace, where we had a lovely time gazing at the interesting exhibitions and tasting nice French specialties!

#### Second Conference Day

Joe Cheng’s opening keynote premiered the brand new shinymeta, a R package tackling the issue of reproducibility with Shiny by providing tools for capturing logic in a Shiny app and exposing it as code that can be run outside of Shiny.

In the lightning talks session on “Workflow and Development”, Miraier Nikki presented CompareWith, a Meld-based R package, which provides user-friendly RStudio addins to perform diff and merge tasks. The slides from her talk can be found here.

The two “Programming” sessions were a great source of updates and novelty. Of particular interest were the introduction of the new embracing operator to simplify creating tidy eval functions, and Davis Vaughan’s talk about the new package rray that makes array calculations much easier. Colin Gillespie’s talk about Security and R was also one of the highlights.

The very interesting session on “Community and Conferences” work gave a nice overview of how R is spreading in Africa and Latin America. It also featured a panel on the truth about satRdays, which offered a set of useful tips for organizing a (successful) event.

Closing the day was a poster session with a nice apero. It was extremely fun to first see the presenters advertising their posters in a 30 seconds time-slot, some giving directions to the location of their poster, some having to step off stage while still talking. Overall a great occasion to network and get a glimpse of various interesting projects.

#### Third Conference Day

The last day of useR! 2019 featured a last round of lightning and regular talks. Many notable presentations were in the “Performance” session, which included interesting updates about packages data.table
and future. A new promising package pak for managing package installation and dependencies was introduced by Gábor Csárdi, whereas Jim Hester showed how package vroom can boost importing large data-sets.

The closing keynote by Julien Cornebise about ‘AI for Good’ in the R and Python ecosystems conveyed an extremely powerful message on how to use data science for good, e.g. to measure violence and abuse against women on Twitter or find burned villages in Darfur via satellite images. Checking out his talk is highly recommended.

Finally Heather Turner got awarded for her central role in the R community and we would like to add our voice in praising her commitment.

#### Closing Remarks

If you could not attend a talk or would like to listen to one again, video recordings of the presentations are (will be) available online on the R Consortium Youtube channel, while slides of the talks are downloadable from the useR! 2019 website.

Overall it was a very enjoyable and insightful event and a great venue in Toulouse. We look forward to useR! 2020 in St. Louis! Before that, and closer for Europeans, we’ll be sure to attend eRum2020, which Mirai Solutions is also supporting actively and takes place in Milan May 27th-30th.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Distilled News

I am an intelligent machine. Humans call me artificially intelligent. I have been built to help de-clutter the deluge of information humans are producing. I decided to see myself (AI) through the information humans have put out on the internet. I can read internet content. Here I connected to the cord Reddit has provided. Humans call it API. I kept reading over a period of time and obtained 7000 posts and comments that humans had done on me. My brain has multiple stacks of neural networks. Humans call it deep learning networks. I can identify names, can classify what I read and understand the topics written. I am trained in English Grammar and can distinguish what humans call as nouns, verbs, and adjectives. I can pull all of these together and make sense of what matters. I have developed ‘attention’ and can figure out what matters in the deluge. First I understood what are the overall conversations humans are having about me. This is what I saw.
As Machine Learning and AI are becoming more and more popular an increasing number of organizations is adopting this new technology. Predictive modeling is helping processes becoming more efficient but also allow users to gain benefits. One can predict how much you are likely going to be earning based on your professional skills and experience. The output could simply be a number, but users typically want to know why that value is given! In this article, I will demonstrate some methods for creating explainable predictions and guide you into opening these black-box models.
In Machine Learning is normal to deal with Anomaly Detection tasks. Data Science frequently are engaged in problem where they have to show, explain and predict anomalies. I also made a post about Anomaly Detection with Time Series, where I studied an internal system behaviour and I provided anomaly forecasts in the future. In this post I try to solve a different challenge. I change domain of interest: swapping from Time Series to Images. Given an image, we want to achive a dual purpose: predict the presence of anomalies and individuate them, giving a colourful representation of the results.
Neural networks are often described as universal function approximators. Given an appropriate architecture, these algorithms can learn almost any representation. Consequently, many interesting tasks have been implemented using Neural Networks – Image classification, Question Answering, Generative modeling, Robotics and many more. In this tutorial, we implement a popular task in Natural Language Processing called Language modeling. Language modeling deals with a special class of Neural Network trying to learn a natural language so as to generate it. We implement this model using a popular deep learning library called Pytorch.
In last article, we walked through how to model an environment in an reinforcement learning setting and how to leverage the model to accelerate the learning process. In this article, I would like to further the topic and introduce 2 more algorithms, Dyna-Q+ and Priority Sweeping, both based on Dyna-Q method that we learnt in last article. (If you find some game settings confusing, please check my last article)
Since it was introduced last year, ‘Universal Sentence Encoder (USE) for English” has become one of the most downloaded pre-trained text modules in Tensorflow Hub, providing versatile sentence embedding models that convert sentences into vector representations. These vectors capture rich semantic information that can be used to train classifiers for a broad range of downstream tasks. For example, a strong sentiment classifier can be trained from as few as one hundred labeled examples, and still be used to measure semantic similarity and for meaning-based clustering. Today, we are pleased to announce the release of three new USE multilingual modules with additional features and potential applications. The first two modules provide multilingual models for retrieving semantically similar text, one optimized for retrieval performance and the other for speed and less memory usage. The third model is specialized for question-answer retrieval in sixteen languages (USE-QA), and represents an entirely new application of USE. All three multilingual modules are trained using a multi-task dual-encoder framework, similar to the original USE model for English, while using techniques we developed for improving the dual-encoder with additive margin softmax approach. They are designed not only to maintain good transfer learning performance, but to perform well on semantic retrieval tasks.
Whenever you are handling data, you will always face relative features. Those latter are the variables we take into account to describe our data. Namely, if you are collecting some data about houses in Milan, typical features might be position, dimension, floor and so on. However, it often happens that your data are presented to you provided with many features, sometimes hundreds of them….but do you need all of them? Well, keeping in mind the law of parsimony, we’d rather handle a dataset with few features: it will be far easier and faster to train. On the other hand, we do not want to lose important information while getting rid of some features.
Recently I got an opportunity to work on Survival Analysis. Like any other project, I got excited and started to explore more about Survival Analysis. As per wiki, ‘Survival analysis is a branch of statistics for analyzing the expected duration of time until one or more events happen, such as death in biological organisms and failure in mechanical systems.’ In short, it is a time to event analysis that focuses on the time at which the event of interest occurs. The event can be death, sensor failure, or occurrence of a disease, etc. Survival analysis is a popular field having a wide range of use cases in Medicine, Epidemiology, Engineering, etc. Do you wonder how such a significant field is transformed by Deep Learning? If yes, then you came to the right place.
In this guide I have tried to cover the different types and features of distances that can be used in K-Means Clustering
Similarity or distance measures are core components used by distance-based clustering algorithms to cluster similar data points into the same clusters, while dissimilar or distant data points are placed into different clusters. The performance of similarity measures is mostly addressed in two or three-dimensional spaces, beyond which, to the best of our knowledge, there is no empirical study that has revealed the behavior of similarity measures when dealing with high-dimensional datasets. To fill this gap, a technical framework is proposed in this study to analyze, compare and benchmark the influence of different similarity measures on the results of distance-based clustering algorithms. For reproducibility purposes, fifteen publicly available datasets were used for this study, and consequently, future distance measures can be evaluated and compared with the results of the measures discussed in this work. These datasets were classified as low and high-dimensional categories to study the performance of each measure against each category. This research should help the research community to identify suitable distance measures for datasets and also to facilitate a comparison and evaluation of the newly proposed similarity or distance measures with traditional ones.
A demonstration of how MLflow can improve your ML modelling experience.
Simulating data is an invaluable tool. I use simulations to conduct power analyses, probe how robust methods are to violating assumptions, and examine how different methods handle different types of data. If I’m learning something new or writing a model from scratch, I’ll simulate data so that I know the correct answer – and make sure my model gives me that answer. But simulations can be complicated. Many other programming languages require for loops to do a process multiple times; nesting many conditional statements and other for loops within for loops can quickly be difficult to read and debug. In this post, I’ll show how I do modular simulations by writing R functions and using the apply family of R functions to repeat processes. I use examples from Paul Nahin’s book, Digital Dice: Computational Solutions to Practical Probability Problems, and I show how his MATLAB code differs from what is possible in R. My background is in the social sciences; I learned statistics as a tool to answer questions about psychology and behavior. Despite being a quantitative social scientist professionally now, I was not on the advanced math track in high school, and I never took a proper calculus class. I don’t know the theoretical math or how to derive things, but I am good at R programming and can simulate instead! All of these problems have derivations and theoretically-correct answers, but Nahin writes the book to show how simulation studies can achieve the same answer.
In this tutorial we looked at how to solve an optimal assignment problem where each task or item had to matched with a person to maximize the total liking value. We demonstrated how the brute force way of solving the problem with all combinations is intractable with large group sizes but that the Hungarian algorithm can solve it in a fractions of a second. I hope this helped and hopefully you would be inclined to try this out at your White Elephant this winter. I can see it also being useful for assigning tasks to household members or among the staff of your business. Happy holidays!
Artificial Intelligence pioneer Marvin Minsky, co-founder of MIT’s Artificial Intelligence laboratory In 1966, he told one of his graduate students to interface a camera to a computer and make it describe what it sees. In the 50 years following, computers learned to count and classify but still weren’t able to see until now. Today, as of 2019, the field of computer vision is rapidly flourishing, holding vast potential to alleviate everything from healthcare disparities to mobility limitations on a global scale.

### Whats new on arXiv

The connection of edges in a graph generates a structure that is independent of a coordinate system. This visual metaphor allows creating a more flexible representation of data than a two-dimensional scatterplot. In this work, we present STAD (Spanning Trees as Approximation of Data), a dimensionality reduction method to approximate the high-dimensional structure into a graph with or without formulating prior hypotheses. STAD generates an abstract representation of high-dimensional data by giving each data point a location in a graph which preserves the distances in the original high-dimensional space. The STAD graph is built upon the Minimum Spanning Tree (MST) to which new edges are added until the correlation between the distances from the graph and the original dataset is maximized. Additionally, STAD supports the inclusion of additional functions to focus the exploration and allow the analysis of data from new perspectives, emphasizing traits in data which otherwise would remain hidden. We demonstrate the effectiveness of our method by applying it to two real-world datasets: traffic density in Barcelona and temporal measurements of air quality in Castile and Le\’on in Spain.
One of the oldest problems in the data stream model is to approximate the $p$-th moment $\|\mathcal{X}\|_p^p = \sum_{i=1}^n |\mathcal{X}_i|^p$ of an underlying vector $\mathcal{X} \in \mathbb{R}^n$, which is presented as a sequence of poly$(n)$ updates to its coordinates. Of particular interest is when $p \in (0,2]$. Although a tight space bound of $\Theta(\epsilon^{-2} \log n)$ bits is known for this problem when both positive and negative updates are allowed, surprisingly there is still a gap in the space complexity when all updates are positive. Specifically, the upper bound is $O(\epsilon^{-2} \log n)$ bits, while the lower bound is only $\Omega(\epsilon^{-2} + \log n)$ bits. Recently, an upper bound of $\tilde{O}(\epsilon^{-2} + \log n)$ bits was obtained assuming that the updates arrive in a random order. We show that for $p \in (0, 1]$, the random order assumption is not needed. Namely, we give an upper bound for worst-case streams of $\tilde{O}(\epsilon^{-2} + \log n)$ bits for estimating $\|\mathcal{X}\|_p^p$. Our techniques also give new upper bounds for estimating the empirical entropy in a stream. On the other hand, we show that for $p \in (1,2]$, in the natural coordinator and blackboard communication topologies, there is an $\tilde{O}(\epsilon^{-2})$ bit max-communication upper bound based on a randomized rounding scheme. Our protocols also give rise to protocols for heavy hitters and approximate matrix product. We generalize our results to arbitrary communication topologies $G$, obtaining an $\tilde{O}(\epsilon^{2} \log d)$ max-communication upper bound, where $d$ is the diameter of $G$. Interestingly, our upper bound rules out natural communication complexity-based approaches for proving an $\Omega(\epsilon^{-2} \log n)$ bit lower bound for $p \in (1,2]$ for streaming algorithms. In particular, any such lower bound must come from a topology with large diameter.
We propose symbolic learning as extensions to standard inductive learning models such as neural nets as a means to solve few shot learning problems. We device a class of visual discrimination puzzles that calls for recognizing objects and object relationships as well learning higher-level concepts from very few images. We propose a two-phase learning framework that combines models learned from large data sets using neural nets and symbolic first-order logic formulas learned from a few shot learning instance. We develop first-order logic synthesis techniques for discriminating images by using symbolic search and logic constraint solvers. By augmenting neural nets with them, we develop and evaluate a tool that can solve few shot visual discrimination puzzles with interpretable concepts.
While deep neural networks (NNs) do not provide the confidence of its prediction, Bayesian neural network (BNN) can estimate the uncertainty of the prediction. However, BNNs have not been widely used in practice due to the computational cost of inference. This prohibitive computational cost is a hindrance especially when processing stream data with low-latency. To address this problem, we propose a novel model which approximate BNNs for data streams. Instead of generating separate prediction for each data sample independently, this model estimates the increments of prediction for a new data sample from the previous predictions. The computational cost of this model is almost the same as that of non-Bayesian NNs. Experiments with semantic segmentation on real-world data show that this model performs significantly faster than BNNs, estimating uncertainty comparable to the results of BNNs.
Machine learning components commonly appear in larger decision-making pipelines; however, the model training process typically focuses only on a loss that measures accuracy between predicted values and ground truth values. Decision-focused learning explicitly integrates the downstream decision problem when training the predictive model, in order to optimize the quality of decisions induced by the predictions. It has been successfully applied to several limited combinatorial problem classes, such as those that can be expressed as linear programs (LP), and submodular optimization. However, these previous applications have uniformly focused on problems from specific classes with simple constraints. Here, we enable decision-focused learning for the broad class of problems that can be encoded as a Mixed Integer Linear Program (MIP), hence supporting arbitrary linear constraints over discrete and continuous variables. We show how to differentiate through a MIP by employing a cutting planes solution approach, which is an exact algorithm that iteratively adds constraints to a continuous relaxation of the problem until an integral solution is found. We evaluate our new end-to-end approach on several real world domains and show that it outperforms the standard two phase approaches that treat prediction and prescription separately, as well as a baseline approach of simply applying decision-focused learning to the LP relaxation of the MIP.
The performance of deep neural networks, such as Deep Belief Networks formed by Restricted Boltzmann Machines (RBMs), strongly depends on their training, which is the process of adjusting their parameters. This process can be posed as an optimization problem over n dimensions. However, typical networks contain tens of thousands of parameters, making this a High-Dimensional Problem (HDP). Although different optimization methods have been employed for this goal, the use of most of the Evolutionary Algorithms (EAs) becomes prohibitive due to their inability to deal with HDPs. For instance, the Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES) which is regarded as one of the most effective EAs, exhibits the enormous disadvantage of requiring $O(n^2)$ memory and operations, making it unpractical for problems with more than a few hundred variables. In this paper, we introduce a novel EA that requires $O(n)$ operations and memory, but delivers competitive solutions for the training stage of RBMs with over one million variables, when compared against CMA-ES and the Contrastive Divergence algorithm, which is the standard method for training RBMs.
The ever evolving informatics technology has gradually bounded human and computer in a compact way. Understanding user behavior becomes a key enabler in many fields such as sedentary-related healthcare, human-computer interaction (HCI) and affective computing. Traditional sensor-based and vision-based user behavior analysis approaches are obtrusive in general, hindering their usage in realworld. Therefore, in this article, we first introduce WiFi signal as a new source instead of sensor and vision for unobtrusive user behaviors analysis. Then we design BeSense, a contactless behavior analysis system leveraging signal processing and computational intelligence over WiFi channel state information (CSI). We prototype BeSense on commodity low-cost WiFi devices and evaluate its performance in realworld environments. Experimental results have verified its effectiveness in recognizing user behaviors.
In today world of enormous amounts of data, it is very important to extract useful knowledge from it. This can be accomplished by feature subset selection. Feature subset selection is a method of selecting a minimum number of features with the help of which our machine can learn and predict which class a particular data belongs to. We will introduce a new adaptive algorithm called Feature selection Penguin Search optimization algorithm which is a metaheuristic approach. It is adapted from the natural hunting strategy of penguins in which a group of penguins take jumps at random depths and come back and share the status of food availability with other penguins and in this way, the global optimum solution is found. In order to explore the feature subset candidates, the bioinspired approach Penguin Search optimization algorithm generates during the process a trial feature subset and estimates its fitness value by using three different classifiers for each case: Random Forest, Nearest Neighbour and Support Vector Machines. However, we are planning to implement our proposed approach Feature selection Penguin Search optimization algorithm on some well known benchmark datasets collected from the UCI repository and also try to evaluate and compare its classification accuracy with some state of art algorithms.
We study the problem of distributed Kalman filtering for sensor networks in the presence of model uncertainty. More precisely, we assume that the actual state-space model belongs to a ball, in the Kullback-Leibler topology, about the nominal state-space model and whose radius reflects the mismatch modeling budget allowed for each time step. We propose a distributed Kalman filter with diffusion step which is robust with respect to the aforementioned model uncertainty. Moreover, we derive the corresponding least favorable performance. Finally, we check the effectiveness of the proposed algorithm in the presence of uncertainty through a numerical example.
Graph neural networks (GNNs) have emerged recently as a powerful architecture for learning node and graph representations. Standard GNNs have the same expressive power as the Weisfeiler-Leman test of graph isomorphism in terms of distinguishing non-isomorphic graphs. However, it was recently shown that this test cannot identify fundamental graph properties such as connectivity and triangle freeness. We show that GNNs also suffer from the same limitation. To address this limitation, we propose a more expressive architecture, k-hop GNNs, which updates a node’s representation by aggregating information not only from its direct neighbors, but from its k-hop neighborhood. We show that the proposed architecture can identify fundamental graph properties. We evaluate the proposed architecture on standard node classification and graph classification datasets. Our experimental evaluation confirms our theoretical findings since the proposed model achieves performance better or comparable to standard GNNs and to state-of-the-art algorithms.
Capsule networks have gained a lot of popularity in short time due to its unique approach to model equivariant class specific properties as capsules from images. However the dynamic routing algorithm comes with a steep computational complexity. In the proposed approach we aim to create scalable versions of the capsule networks that are much faster and provide better accuracy in problems with higher number of classes. By using dynamic routing to extract intermediate features instead of generating output class specific capsules, a large increase in the computational speed has been observed. Moreover, by extracting equivariant feature capsules instead of class specific capsules, the generalization capability of the network has also increased as a result of which there is a boost in accuracy.
Compressing giant neural networks has gained much attention for their extensive applications on edge devices such as cellphones. During the compressing process, one of the most important procedures is to retrain the pre-trained models using the original training dataset. However, due to the consideration of security, privacy or commercial profits, in practice, only a fraction of sample training data are made available, which makes the retraining infeasible. To solve this issue, this paper proposes to resort to unlabeled data in hand that can be cheaper to acquire. Specifically, we exploit the unlabeled data to mimic the classification characteristics of giant networks, so that the original capacity can be preserved nicely. Nevertheless, there exists a dataset bias between the labeled and unlabeled data, disturbing the mimicking to some extent. We thus fix this bias by an adversarial loss to make an alignment on the distributions of their low-level feature representations. We further provide theoretical discussions about how the unlabeled data help compressed networks to generalize better. Experimental results demonstrate that the unlabeled data can significantly improve the performance of the compressed networks.
Knowledge graph embedding models often suffer from a limitation of remembering existing triples to predict new triples. To overcome this issue, we introduce a novel embedding model, named R-MeN, that explores a relational memory network to model relationship triples. In R-MeN, we simply represent each triple as a sequence of 3 input vectors which recurrently interact with a relational memory. This memory network is constructed to incorporate new information using a self-attention mechanism over the memory and input vectors to return a corresponding output vector for every timestep. Consequently, we obtain 3 output vectors which are then multiplied element-wisely into a single one; and finally, we feed this vector to a linear neural layer to produce a scalar score for the triple. Experimental results show that our proposed R-MeN obtains state-of-the-art results on two well-known benchmark datasets WN11 and FB13 for triple classification task.
The machine learning community has been overwhelmed by a plethora of deep learning based approaches. Many challenging computer vision tasks such as detection, localization, recognition and segmentation of objects in unconstrained environment are being efficiently addressed by various types of deep neural networks like convolutional neural networks, recurrent networks, adversarial networks, autoencoders and so on. While there have been plenty of analytical studies regarding the object detection or recognition domain, many new deep learning techniques have surfaced with respect to image segmentation techniques. This paper approaches these various deep learning techniques of image segmentation from an analytical perspective. The main goal of this work is to provide an intuitive understanding of the major techniques that has made significant contribution to the image segmentation domain. Starting from some of the traditional image segmentation approaches, the paper progresses describing the effect deep learning had on the image segmentation domain. Thereafter, most of the major segmentation algorithms have been logically categorized with paragraphs dedicated to their unique contribution. With an ample amount of intuitive explanations, the reader is expected to have an improved ability to visualize the internal dynamics of these processes.
In this paper, we study a continuous-time discounted jump Markov decision process with both controlled actions and observations. The observation is only available for a discrete set of time instances. At each time of observation, one has to select an optimal timing for the next observation and a control trajectory for the time interval between two observation points. We provide a theoretical framework that the decision maker can utilize to find the optimal observation epochs and the optimal actions jointly. Two cases are investigated. One is gated queueing systems in which we explicitly characterize the optimal action and the optimal observation where the optimal observation is shown to be independent of the state. Another is the inventory control problem with Poisson arrival process in which we obtain numerically the optimal action and observation. The results show that it is optimal to observe more frequently at a region of states where the optimal action adapts constantly.
How to properly model graphs is a long-existing and important problem in NLP area, where several popular types of graphs are knowledge graphs, semantic graphs and dependency graphs. Comparing with other data structures, such as sequences and trees, graphs are generally more powerful in representing complex correlations among entities. For example, a knowledge graph stores real-word entities (such as ‘Barack_Obama’ and ‘U.S.’) and their relations (such as ‘live_in’ and ‘lead_by’). Properly encoding a knowledge graph is beneficial to user applications, such as question answering and knowledge discovery. Modeling graphs is also very challenging, probably because graphs usually contain massive and cyclic relations. Recent years have witnessed the success of deep learning, especially RNN-based models, on many NLP problems. Besides, RNNs and their variations have been extensively studied on several graph problems and showed preliminary successes. Despite the successes that have been achieved, RNN-based models suffer from several major drawbacks on graphs. First, they can only consume sequential data, thus linearization is required to serialize input graphs, resulting in the loss of important structural information. Second, the serialization results are usually very long, so it takes a long time for RNNs to encode them. In this thesis, we propose a novel graph neural network, named graph recurrent network (GRN). We study our GRN model on 4 very different tasks, such as machine reading comprehension, relation extraction and machine translation. Some take undirected graphs without edge labels, while the others have directed ones with edge labels. To consider these important differences, we gradually enhance our GRN model, such as further considering edge labels and adding an RNN decoder. Carefully designed experiments show the effectiveness of GRN on all these tasks.
Approximate Nearest Neighbor Search (ANNS) in high dimensional space is essential in database and information retrieval. Recently, there has been a surge of interests in exploring efficient graph-based indices for the ANNS problem. Among them, the NSG has resurrected the theory of Monotonic Search Networks (MSNET) and achieved the state-of-the-art performance. However, the performance of the NSG deviates from a potentially optimal position due to the high sparsity of the graph. Specifically, though the average degree of the graph is small, their search algorithm travels a longer way to reach the query. Integrating both factors, the total search complexity (i.e., the number of distance calculations) is not minimized as their wish. In addition, NSG suffers from a high indexing time complexity, which limits the efficiency and the scalability of their method. In this paper, we aim to further mine the potential of the MSNETs. Inspired by the message transfer mechanism of the communication satellite system, we find a new family of MSNETs, namely the Satellite System Graphs (SSG). In particular, while inheriting the superior ANNS properties from the MSNET, we try to ensure the angles between the edges to be no smaller than a given value. Consequently, each node in the graph builds effective connections to its neighborhood omnidirectionally, which ensures an efficient search-routing on the graph like the message transfer among the satellites. We also propose an approximation of the SSG, Navigating SSG, to increase the efficiency of indexing. Both theoretical and extensive experimental analysis are provided to demonstrate the strengths of the proposed approach over the existing state-of-the-art algorithms. Our code has been released on GitHub.

### Generating a Gallery of Visualizations for a Static Website (using R)

(This article was first published on r on Tony ElHabr, and kindly contributed to R-bloggers)

While I was browsing the website of fellow R blogger Ryo Nakagawara1, I
was intrigued by his “Visualizations” page.
The concept of creating an online “portfolio” is not novel 2, but
I hadn’t thought to make one as a compilation of my own work (from blog posts)…
until now .

The code that follows shows how I generated the
body of my visualization portfolio page.
The task is achieved in a couple of steps.

1. Identify the file path of each blog post in my local blog folder

2. For each post, extract the date and title of the blog post from the front matter, as well as the name and links to the image files.

3. Combine the extracted information into a character vector that can be copy-pasted to a gallery page.

I should state a couple of caveats/notes for anyone looking to emulate my approach.

• I take advantage of the fact that I use the same prefix for (almost) all
visualization that I generate with R—viz_3, as well as the same file
format—.png.

• At the time of writing, my website—a {blogdown}-based
website—uses Hugo’s page bundles
content organization system,
as well as the with the popular Academic theme for Hugo.
Thus, there’s no guarantee that the following code will work for you “as is”. 4
(If it doesn’t work as is, I think modifying the code should be fairly straightforward.)

• I create headers and links from the titles of the blog posts
(via something like sprintf('## %s, [%s](%s)', ...) and I order everything according
to descending date and ascending line in the post.
This may not be what you would like for your gallery format.

## The Code

library(tidyverse)

paths_post_raw <-
fs::dir_ls(
'content/post/',
regexp = 'index[.]md$', recurse = TRUE ) %>% # Ignore the "_index.md" at the base of the content/post directory. # Would need to also ignore draft posts if there are draft posts. str_subset('_index', negate = TRUE) paths_post_raw[1:10] ## [1] "content/post/analysis-texas-high-school-academics-1-intro/index.md" ## [2] "content/post/analysis-texas-high-school-academics-2-competitions/index.md" ## [3] "content/post/analysis-texas-high-school-academics-3-individuals/index.md" ## [4] "content/post/analysis-texas-high-school-academics-4-schools/index.md" ## [5] "content/post/analysis-texas-high-school-academics-5-miscellaneous/index.md" ## [6] "content/post/cheat-sheet-rmarkdown/index.md" ## [7] "content/post/data-science-podcasts/index.md" ## [8] "content/post/dry-principle-make-a-package/index.md" ## [9] "content/post/gallery-visualizations/old/index.md" ## [10] "content/post/interval-data-nycflights13/index.md" # Define some important regular expressions (or "regex"es). # These regexes are probably applicable to most Hugo/blogdown setups. rgx_replace <- '(content\\/post\\/)(.*)(\\/)(.*)([.]png$)'
rgx_title <- '^title[:]\\s+'
rgx_date <- 'date[:]\\s+'

# This regex is particular to the way that I name and save my ggplots.
rgx_viz <- '(^[!][\$][\$].*)(viz.*png)(.*$)' # Define a helper function for a common idiom that we will implement for extracting the ines of markdown that we # want---those containing the title, data, and visualization---and trimming them just to the text that we want # (i.e. removing "title:" and "date:" preceding the title and date in the YAML/TOML header, and # removing the "![]" preceding an image). str_pluck <- function(x, pattern, replacement = '') { x %>% str_subset(pattern) %>% str_replace_all(pattern = pattern, replacement = replacement) %>% str_trim() } str_pluck_title <- purrr::partial(str_pluck, pattern = rgx_title) str_pluck_date <- purrr::partial(str_pluck, pattern = rgx_date) str_pluck_viz <- purrr::partial(str_pluck, pattern = rgx_viz, replacement = '\\2') # Extract the title, date, and visualizations from each post. # Note that there should be only one title and date pers post, but there are likely more than one visualization per post. paths_post <- paths_post_raw %>% as.character() %>% tibble(path_post = .) %>% mutate( lines = path_post %>% purrr::map(read_lines) ) %>% mutate_at(vars(path_post), ~str_remove_all(., 'content|\\/index[.]md')) %>% mutate_at( vars(lines), list( title = ~purrr::map_chr(., str_pluck_title), date = ~purrr::map_chr(., str_pluck_date) %>% lubridate::ymd(), viz = ~purrr::map(., str_pluck_viz) ) ) %>% # viz is a list item (because there may be more than one per post), so we need to unnest() it to return a "tidy" data frame. unnest(viz) %>% select(date, viz, title, path_post) paths_post ## # A tibble: 71 x 4 ## date viz title path_post ## ## 1 2018-05-20 viz_map_bycomp~ An Analysis of Texas H~ /post/analysis-texas~ ## 2 2018-05-20 viz_map_bycomp~ An Analysis of Texas H~ /post/analysis-texas~ ## 3 2018-05-20 viz_n_bycomplv~ An Analysis of Texas H~ /post/analysis-texas~ ## 4 2018-05-20 viz_n_bycomp-1~ An Analysis of Texas H~ /post/analysis-texas~ ## 5 2018-05-20 viz_n_bycompco~ An Analysis of Texas H~ /post/analysis-texas~ ## 6 2018-05-20 viz_n_bycompco~ An Analysis of Texas H~ /post/analysis-texas~ ## 7 2018-05-20 viz_persons_st~ An Analysis of Texas H~ /post/analysis-texas~ ## 8 2018-05-20 viz_persons_st~ An Analysis of Texas H~ /post/analysis-texas~ ## 9 2018-05-20 viz_persons_st~ An Analysis of Texas H~ /post/analysis-texas~ ## 10 2018-05-20 viz_persons_st~ An Analysis of Texas H~ /post/analysis-texas~ ## # ... with 61 more rows # Create the markdown lines for images (visualizations) for our gallery markdown output. paths_post_md <- paths_post %>% mutate( label_md = sprintf('![%s](%s/%s)', viz, path_post, viz) ) %>% select(title, date, path_post, label_md) paths_post_md ## # A tibble: 71 x 4 ## title date path_post label_md ## ## 1 An Analysis of Tex~ 2018-05-20 /post/analysis-tex~ ![viz_map_bycomplvl_~ ## 2 An Analysis of Tex~ 2018-05-20 /post/analysis-tex~ ![viz_map_bycomplvl_~ ## 3 An Analysis of Tex~ 2018-05-20 /post/analysis-tex~ ![viz_n_bycomplvl-1.~ ## 4 An Analysis of Tex~ 2018-05-20 /post/analysis-tex~ ![viz_n_bycomp-1.png~ ## 5 An Analysis of Tex~ 2018-05-20 /post/analysis-tex~ ![viz_n_bycompcomplv~ ## 6 An Analysis of Tex~ 2018-05-20 /post/analysis-tex~ ![viz_n_bycompcomplv~ ## 7 An Analysis of Tex~ 2018-05-20 /post/analysis-tex~ ![viz_persons_stats_~ ## 8 An Analysis of Tex~ 2018-05-20 /post/analysis-tex~ ![viz_persons_stats_~ ## 9 An Analysis of Tex~ 2018-05-20 /post/analysis-tex~ ![viz_persons_stats_~ ## 10 An Analysis of Tex~ 2018-05-20 /post/analysis-tex~ ![viz_persons_stats_~ ## # ... with 61 more rows # Create the "main" data frame with the titles and dates in columns alongside the image column. # In "tidy data" terminology, images are the "observations" *and title and date are separate variables). content_gallery_raw <- paths_post_md %>% group_by(title, date, path_post) %>% # Add a "placeholder" line for the title of the post. do(add_row(., .before = 0)) %>% ungroup() %>% # In the first case, create the H2 markkdown heading line for the name of the post. # In the second case, use the image markdown line created above. mutate_at( vars(label_md), ~case_when( is.na(.) ~ sprintf('## %s, [%s](%s)', dplyr::lead(date), dplyr::lead(title), dplyr::lead(path_post)), TRUE ~ . ) ) %>% # Impute the title and date values to go with the image values. fill(title, .direction = 'up') %>% fill(date, .direction = 'up') %>% arrange(date) %>% # Number the posts in order of descending date. mutate(idx_intragrp = dense_rank(sprintf('%s, %s', date, title))) %>% group_by(title, date) %>% # Number the images within each post. (This isn't completely necessary. It's only used for sorting.) mutate(idx_intergrp = row_number()) %>% ungroup() %>% select(idx_intragrp, idx_intergrp, date, label_md) %>% arrange(desc(idx_intragrp), idx_intergrp) content_gallery_raw ## # A tibble: 90 x 4 ## idx_intragrp idx_intergrp date label_md ## ## 1 19 1 2019-06-29 ## 2019-06-29, [Text Parsing and T~ ## 2 19 2 2019-06-29 ![viz_toc_n_1yr_tree.png](/post/te~ ## 3 19 3 2019-06-29 ![viz_content_section_n.png](/post~ ## 4 19 4 2019-06-29 ![viz_toc_content_n1.png](/post/te~ ## 5 19 5 2019-06-29 ![viz_sents_section_n.png](/post/t~ ## 6 19 6 2019-06-29 ![viz_sents_section_n_yr.png](/pos~ ## 7 19 7 2019-06-29 ![viz_sents_section_sim.png](/post~ ## 8 19 8 2019-06-29 ![viz_words_section_tfidf.png](/po~ ## 9 19 9 2019-06-29 ![viz_words_tfidf.png](/post/text-~ ## 10 18 1 2019-01-27 ## 2019-01-27, [Summarizing rstudi~ ## # ... with 80 more rows # Create the final markdown output. content_gallery <- content_gallery_raw %>% select(label_md) %>% mutate(idx = row_number()) %>% # Add a blank line between the end of one section's last image # and the next sections H2 header. group_by(idx) %>% do(add_row(., label_md = '', .before = 0)) %>% ungroup() content_gallery ## # A tibble: 180 x 2 ## label_md idx ## ## 1 "" NA ## 2 ## 2019-06-29, [Text Parsing and Text Analysis of a Periodic Repo~ 1 ## 3 "" NA ## 4 ![viz_toc_n_1yr_tree.png](/post/text-parsing-analysis-periodic-re~ 2 ## 5 "" NA ## 6 ![viz_content_section_n.png](/post/text-parsing-analysis-periodic~ 3 ## 7 "" NA ## 8 ![viz_toc_content_n1.png](/post/text-parsing-analysis-periodic-re~ 4 ## 9 "" NA ## 10 ![viz_sents_section_n.png](/post/text-parsing-analysis-periodic-r~ 5 ## # ... with 170 more rows content_copypaste <- content_gallery %>% pull(label_md) # Copy paste this to the markdown file for the gallery page. # It's probably possible to do this a bit more programmatically (i.e. without # "manually" copying into a markdown file, but oh well clipr::write_clip(content_copypaste) 1. one of my favorite R bloggers, by the way 2. in fact, many people dedicate website’s exclusively to showing off work that they’ve done. 3. I apologize if I’ve offended English speakers/readers who use/prefer “s” to “z” (for “viz”). I’m American, and nearly all Americans use “z”! 4. Because Hugo websites follow a standard structure, I really don’t think that the choice of theme shouldn’t be the reason why this wouldn’t work for someone else’s website, but I figured I would mention the theme. To leave a comment for the author, please follow the link and comment on their blog: r on Tony ElHabr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Adding Syntax Highlight (This article was first published on R on notast, and kindly contributed to R-bloggers) # Syntax highlighting Previously, I posted entries without any syntax highlighting as I was satisfied using basic blogdown and Hugo functions until a Disqus member commented in the previous post to use syntax highlighting. Thus, I tasked myself to learn more about syntax highlighting and to implement them in future posts. Now I’d like to share what I’ve learned. # Themes with embedded syntax highlight feature Some hugo themes have in built syntax highlight functions. For instances, blogdown’s default theme, hugo-lithium, has Highlight.js option readily avilable. For the hugo-lithium theme, you can access details of the Highlight.js in the config.toml file. highlightjsVersion = "9.12.0" highlightjsCDN = "//cdnjs.cloudflare.com/ajax/libs" highlightjsLang = ["r", "yaml"] highlightjsTheme = "github" highlightjsVersion refers to the version of highlightjs highlightjsCDN refers to the CDN provider of highlight.js. For each CDN provider and version number, you can determine if the coding language you intend to highlight and the style of the highlight is available. highlightjsLang refers to the coding languages to be highlighted. By default, r and yaml are highlighted. You can add other languages to be highlighted as long as the CDN provider and version number supports the highlighting of those languages. highlightjsTheme refers to the colour theme for the highlighted syntax. You can preview the various themes for different languages on [highlightjs]’s website before deciding which theme to adopt for your own site. (https://highlightjs.org/static/demo/) # Themes without embedded syntax highlight feature There are many themes which do not have in built highlighting functions including the Mainroad theme which I’m using. There are two approaches to add syntax highlighting for these Hugo themes. ## blogdown textbook approach 1. In the book, Xie, Thomas and Hill recommends adding to head_custom.html file. For the Mainroad theme, I added the script to the bottom of head.html file. You can change the colour theme by replacing the github theme to your desired theme. 1. Next, they require you to add   to foot_custom.html. For the Mainroad theme, I added the script to the bottom of footer.html file. ## Amber Thomas’s approach 1. Download Highlight.js for R. 2. From the file you downloaded, copy highlight.pack.js file and paste into the js folder for your Hugo theme. For Mainroad theme, I accessed it via themes-> Mainroad-> static-> js. 3. From the file you downloaded, go to the style subfolder and copy the css file of your desired syntax colour theme and paste into the css folder for your Hugo theme. For Mainroad theme, I found it via themes-> Mainroad -> static -> css. 4. Add this   to the header.html file. For Mainroad theme, I discovered the file via themes-> Mainroad-> layouts -> partials. You can change the github-gist theme to your selected theme. As the Hugo Mainroad theme displays code chunks in a faint shade of grey, I chose routeros highlighter as it has a similar light grey as its background which complements the Hugo Mainroad theme. # Verify syntax highlighting Apart from visually see the changes on your site, you can go geek mode and verify that the highlighted syntax is based on your selected Highlight.js theme. If you are using, Microsoft Edge select a paragraph of plain text and click on the right hand mouse and select “Inspect element”. You will see two boxes. Examine the box which has style as one of the sub-tab. From that sub-tab, you will noticed that the css style is css.style. Now do the same for a chuck of R script, you will realize that the css style is the name of the Highlight.js theme. For this blog, it is routeros.css. To leave a comment for the author, please follow the link and comment on their blog: R on notast. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Germination data and time-to-event methods: comparing germination curves (This article was first published on R on The broken bridge between biologists and statisticians, and kindly contributed to R-bloggers) Very often, seed scientists need to compare the germination behaviour of different seed populations, e.g., different plant species, or one single plant species submitted to different temperatures, light conditions, priming treatments and so on. How should such a comparison be performed? Let’s take a practical approach and start from an appropriate example: a few years ago, some collegues studied the germination behaviour for seeds of a plant species (Verbascum arcturus, BTW…), in different conditions. In detail, they considered the factorial combination of two storage periods (LONG and SHORT storage) and two temperature regimes (FIX: constant daily temperature of 20°C; ALT: alternating daily temperature regime, with 25°C during daytime and 15°C during night time, with a 12:12h photoperiod). If you are a seed scientist and are interested in this experiment, you’ll find detail in Catara et al. (2016). If you are not a seed scientist you may wonder why my colleagues made such an assay; well, there is evidence that, for some plant species, the germination ability improves over time after seed maturation. Therefore, if we take seeds and store them for a different period of time, there might be an effect on their germination traits. Likewise, there is also evidence that some seeds may not germinate if they are not submitted to daily temperature fluctuations. These mechanisms are very interesting, as they permit to the seed to recognise that the environmental conditions are favourable for seedling survival.My colleagues wanted to discover whether this was the case for Verbascum. Let’s go back to our assay: the experimental design consisted of four combinations (LONG-FIX, LONG-ALT, SHORT-FIX and SHORT-ALT) and four replicates for each combination. One replicate consisted of a Petri dish, that is a small plastic box containing humid blotting paper, with 25 seeds of Verbascum. In all, there were 16 Petri dishes, put in climatic chambers with the appropriate conditions. During the assay, my collegues made daily inspections: germinated seeds were counted and removed from the dishes. Inspections were made for 15 days, until no more germinations could be observed. The dataset is available from a gitHub repository: let’s load it and have a look. dataset <- read.csv("https://raw.githubusercontent.com/OnofriAndreaPG/agroBioData/master/TempStorage.csv", header = T, check.names = F) head(dataset) ## Dish Storage Temp 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ## 1 1 Low Fix 0 0 0 0 0 0 0 0 3 4 6 0 1 0 3 ## 2 2 Low Fix 0 0 0 0 1 0 0 0 2 7 2 3 0 5 1 ## 3 3 Low Fix 0 0 0 0 1 0 0 1 3 5 2 4 0 1 3 ## 4 4 Low Fix 0 0 0 0 1 0 3 0 0 3 1 1 0 4 4 ## 5 5 High Fix 0 0 0 0 0 0 0 0 1 2 5 4 2 3 0 ## 6 6 High Fix 0 0 0 0 0 0 0 0 2 2 7 8 1 2 1 We have one row per Petri dish; the first three columns show, respectively, dish number, storage and temperature conditions. The next 15 columns represent the inspection times (from 1 to 15) and contain the counts of germinated seeds. The research question is: Is germination behaviour affected by storage and temperature conditions? # Response feature analyses One possible line of attack is to take a summary measure for each dish, e.g. the total number of germinated seeds. Taking a single value for each dish brings us back to more common methods of data analysis: for example, we can fit some sort of GLM to test the significance of effects (storage, temperature and their interaction), within a fully factorial design. Although the above method is not wrong, undoubtedly, it may be sub-optimal. Indeed, dishes may contain the same total number of germinated seeds, but, nonetheless, they may differ for some other germination traits, such as velocity or uniformity. Indeed, we do not want to express a judgment about one specific characteristic of the seed lot, we would like to express a judgment about the whole seed lot. In other words, we are not specifically asking: “do the seed lots differ for their germination capability?”. We are, more generally, asking “are the seed lots different?”. In order to get a general assessment, a different method of analysis should be sought, which considers the entire time series (from 1 to 15 days) and not only one single summary measure. This method exists and it is available within the time-to-event platform, which has shown very useful and appropriate for seed germination studies (Onofri et al., 2011; Ritz et al., 2013; Onofri et al., 2019). # The germination time-course It is necessary to re-organise the dataset in a more useful way. A good format can be obtained by using the ‘makeDrm()’ function in the ‘drcSeedGerm’ package, which can be installed from gitHub (see the code at: this link). The function needs to receive a dataframe storing the counts (dataset[,4:18]), a dataframe storing the factor variables (dataset[,2:3]), a vector with the number of seeds in each Petri dish (rep(25, 16)) and a vector of monitoring times. library(drcSeedGerm) datasetR <- makeDrm(dataset[,4:18], dataset[,2:3], rep(25, 16), 1:15) head(datasetR, 16) ## Storage Temp Dish timeBef timeAf count nCum propCum ## 1 Low Fix 1 0 1 0 0 0.00 ## 1.1 Low Fix 1 1 2 0 0 0.00 ## 1.2 Low Fix 1 2 3 0 0 0.00 ## 1.3 Low Fix 1 3 4 0 0 0.00 ## 1.4 Low Fix 1 4 5 0 0 0.00 ## 1.5 Low Fix 1 5 6 0 0 0.00 ## 1.6 Low Fix 1 6 7 0 0 0.00 ## 1.7 Low Fix 1 7 8 0 0 0.00 ## 1.8 Low Fix 1 8 9 3 3 0.12 ## 1.9 Low Fix 1 9 10 4 7 0.28 ## 1.10 Low Fix 1 10 11 6 13 0.52 ## 1.11 Low Fix 1 11 12 0 13 0.52 ## 1.12 Low Fix 1 12 13 1 14 0.56 ## 1.13 Low Fix 1 13 14 0 14 0.56 ## 1.14 Low Fix 1 14 15 3 17 0.68 ## 1.15 Low Fix 1 15 Inf 8 NA NA The snippet above shows the first dish. Roughly speaking, we have gone from a WIDE format to a LONG format. The column ‘timeAf’ contains the time when the inspection was made and the column ‘count’ contains the number of germinated seeds (e.g. 9 seeds were counted at day 9). These seeds did not germinate exactly at day 9; they germinated within the interval between two inspections, that is between day 8 and day 9. The beginning of the interval is given in the variable ‘timeBef’. Apart from these columns, we have additional columns, which we are not going to use for our analyses. The cumulative counts of germinated seeds are in the column ‘nCum’; these cumulative counts have been converted into cumulative proportions by dividing by 25 (i.e., the total number of seeds in a dish; see the column ‘propCum’). We can use a time-to-event model to parameterise the germination time-course for this dish. This is easily done by using the ‘drm()’ function in the ‘drc’ package (Ritz et al., 2013): modPre <- drm(count ~ timeBef + timeAf, fct = LL.3(), data = datasetR, type = "event", subset = c(Dish == 1)) plot(modPre, log = "", xlab = "Time", ylab = "Proportion of germinated seeds", xlim = c(0, 15)) Please, note the following: 1. we are using the counts (‘count’) as the dependent variable 2. as the independent variable: we are using the extremes of the inspection interval, within which germinations were observed (count ~ timeBef + time Af) 3. we have assumed a log-logistic distribution of germination times (fct = LL.3()). A three parameter model is necessary, because there is a final fraction of ungerminated seeds (truncated distribution) 4. we have set the argument ‘type = “event”’. Indeed, we are fitting a time-to-event model, not a nonlinear regression model, which would be incorrect, in this setting (see this link here ). As we have determined the germination time-course for dish 1, we can do the same for all dishes. However, we have to instruct ‘drm()’ to define a different curve for each combination of storage and temperature. It is necessary to make an appropriate use of the ‘curveid’ argument. Please, see below. mod1 <- drm(count ~ timeBef + timeAf, fct = LL.3(), data = datasetR, type = "event", curveid = Temp:Storage) plot(mod1, log = "", legendPos = c(2, 1)) It appears that there are visible differences between the curves (the legend considers the curves in alphabetical order, i.e. 1: Fix-Low, 2: Fix-High, 3: Alt-Low and 4: Alt-High). We can test that the curves are similar by coding a reduced model, where we have only one pooled curve for all treatment levels. It is enough to remove the ‘curveid’ argument. modNull <- drm(count ~ timeBef + timeAf, fct = LL.3(), data = datasetR, type = "event") anova(mod1, modNull, test = "Chisq") ## ## 1st model ## fct: LL.3() ## pmodels: Temp:Storage (for all parameters) ## 2nd model ## fct: LL.3() ## pmodels: 1 (for all parameters) ## ANOVA-like table ## ## ModelDf Loglik Df LR value p value ## 1st model 244 -753.54 ## 2nd model 253 -854.93 9 202.77 0 Now, we can compare the full model (four curves) with the reduced model (one common curve) by using a Likelihood Ratio Test, which is approximately distributed as a Chi-square. The test is highly significant. Of course, we can also test some other hypotheses. For example, we can code a model with different curves for storage times, assuming that the effect of temperature is irrelevant: mod2 <- drm(count ~ timeBef + timeAf, fct = LL.3(), data = datasetR, type = "event", curveid = Storage) anova(mod1, mod2, test = "Chisq") ## ## 1st model ## fct: LL.3() ## pmodels: Temp:Storage (for all parameters) ## 2nd model ## fct: LL.3() ## pmodels: Storage (for all parameters) ## ANOVA-like table ## ## ModelDf Loglik Df LR value p value ## 1st model 244 -753.54 ## 2nd model 250 -797.26 6 87.436 0 We see that such an assumption (temperature effect is irrelevant) is not supported by the data: the temperature effect cannot be removed without causing a significant decrease in the likelihood of the model. Similarly, we can test the effect of storage: mod3 <- drm(count ~ timeBef + timeAf, fct = LL.3(), data = datasetR, type = "event", curveid = Temp) anova(mod1, mod3, test = "Chisq") ## ## 1st model ## fct: LL.3() ## pmodels: Temp:Storage (for all parameters) ## 2nd model ## fct: LL.3() ## pmodels: Temp (for all parameters) ## ANOVA-like table ## ## ModelDf Loglik Df LR value p value ## 1st model 244 -753.54 ## 2nd model 250 -849.48 6 191.87 0 Again, we get significant results. So, we need to conclude that temperature and storage time caused a significant influence on the germination behavior for the species under study. Before concluding, it is necessary to mention that, in general, the above LR tests should be taken with care: the results are only approximate and the observed data are not totally independent, as multiple observations are taken in each experimental unit (Petri dish). In order to restore independence, we would need to add to the model a random effect for the Petri dish, which is not an easy task in a time-to-event framework (Onofri et al., 2019). However, we got very low p-levels, which leave us rather confident about the significance of effects. It may be a good suggestion, in general, to avoid formal hypothesis testing and compare the models by using the Akaike Information Criterion (AIC: the lowest is the best), which confirms that the complete model with four curves is, indeed, the best one. AIC(mod1, mod2, mod3, modNull) ## df AIC ## mod1 244 1995.088 ## mod2 250 2094.524 ## mod3 250 2198.961 ## modNull 253 2215.862 For those who are familiar with linear model parameterisation, it is possible to reach even a higher degree of flexibility by using the ‘pmodels’ argument, within the ‘drm()’ function. However, this will require another post. Thanks for reading! # References 1. Catara, S., Cristaudo, A., Gualtieri, A., Galesi, R., Impelluso, C., Onofri, A., 2016. Threshold temperatures for seed germination in nine species of Verbascum (Scrophulariaceae). Seed Science Research 26, 30–46. 2. Onofri, A., Mesgaran, M.B., Tei, F., Cousens, R.D., 2011. The cure model: an improved way to describe seed germination? Weed Research 51, 516–524. 3. Onofri, A., Piepho, H.-P., Kozak, M., 2019. Analysing censored data in agricultural research: A review with examples and software tips. Annals of Applied Biology 174, 3–13. https://doi.org/10.1111/aab.12477 4. Ritz, C., Pipper, C.B., Streibig, J.C., 2013. Analysis of germination data from agricultural experiments. European Journal of Agronomy 45, 1–6. To leave a comment for the author, please follow the link and comment on their blog: R on The broken bridge between biologists and statisticians. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Continue Reading… ### Youngsters are avoiding Facebook—but not the firm’s other platforms Facebook owes its resilience to savvy acquisitions and tolerant regulators Continue Reading… ## July 19, 2019 ### A rise in premature publications among politically engaged researchers may be linked to Trump’s election, study says A couple people pointed me to this news story, “A rise in premature births among Latina women may be linked to Trump’s election, study says,” and the associated JAMA article, which begins: Question Did preterm births increase among Latina women who were pregnant during the 2016 US presidential election? Findings This population-based study used an interrupted time series design to assess 32.9 million live births and found that the number of preterm births among Latina women increased above expected levels after the election. Meaning The 2016 presidential election may have been associated with adverse health outcomes of Latina women and their newborns. Hmmm, the research article says “may have been associated” but then ups that to “appears to have been associated.” On one hand, I find it admirable that JAMA will publish a paper with such an uncertain conclusion. On the other hand, the conclusions got stronger once they made their way into news reports. In the above-linked article, “may have been associated” becomes “an association was found” and then “We think there are very few alternative explanations for these results.” There’s also a selection issue. It’s fine to report maybes, but then why this particular maybe? There are lots and lots of associations that may be happening, right? Let’s look at the data In any case, they did an interrupted time series analysis, so let’s see the time series: I don’t think the paper’s claim, “In the 9-month period beginning with November 2016, an additional 1342 male (95% CI, 795-1889) and 995 female (95% CI, 554-1436) preterm births to Latina women were found above the expected number of preterm births had the election not occurred,” is at all well supported by these data. But you can make your own judgement here. Also, I’m surprised they are analyzing raw numbers of pre-term births rather than rates. In general Medical journals sometimes seem to show poor judgment when it comes to stories that agree with their political prejudices. See for example here and here. Look. Don’t get me wrong. This topic is important. We’d like to minimize preterm births, and graphs such as shown above (ideally using rates, not counts, I think) should be a key part of a monitoring system that will allow us to notice problems. It should be possible to look at such time series without pulling out one factor and wrapping this sort of story around it. I think this is a problem with scientific publication, that journals and the news media want to publish big splashy claims. Continue Reading… ### Analysis of a Flash Flood (This article was first published on RClimate, and kindly contributed to R-bloggers) Flash floods seem to be increasing in many areas. This post will show how to download local USGS flow and precipitation data and generate a 3-panel chart of flow, gage height and precipitation. There was a tragic flash flood incident in Berks County Pennsylvania on July 11-12, 2019 that caused the death of 3 people, including a pregnant woman and her young son. This newspaper article (link) provides some details on this tragic event. The timing of this flash flooding was interesting to me because the woman and her son were trapped in their car by a small tributary to the Schuylkill River at 4:30 PM on Thursday, July 11 and flooding in Philadelphia started about 1:30 Am on Friday, July 12th. The tragic drowning occurred on the Manatawny Creek , about 35 miles Northwest of Philadelphia. I plan to write a series of posts to document the rainfall patterns and timing of the rainfall and subsequent flooding. This initial post will present show the flow hydrograph, cumulative precipitation and gage height at for the USGS’s Schuylkill river gaging station #01474500 at Fairmount Dam. Follow-up posts will review the upstream USGS data as well as national weather service rainfall data. Here is the 3 panel chart showing the hydrograph, precipitation data and gage height data for the period July 9 – 16, 2019. Here’s the R script for analyzing and plotting csv file downloaded from USGS site. ###### ################################################################################################################# ## get source data file link link <- "C:\\Users\\Kelly O'Day\\Desktop\\Schuylkill_flood\\CSV_files\\Schuylkil_Fairmont_Dam_7_days_july_2019.csv" #(link <- choose.files()) df <- read.csv(link, as.is=T) df$dt <- mdy_hm(df$datetime) dt_min <- df$dt[1]
dt_max <- df$dt[nrow(df)] peak_flow <- max(df$cfs, na.rm=T)
peak_flow_comma <- format(peak_flow, big.mark=',')
peak_dt <- df$dt[which(df$cfs == peak_flow)]
from_to <- paste(month(dt_min),"/",day(dt_min), " to ", month(dt_max),"/",day(dt_max),"/",year(dt_max),sep="")

# Create 3 panel charts for flow, precipitation and gage height
par(mfrow = c(3,1),ps=10, pty="m",xaxs="r",yaxs="r")
par(las=1,oma=c(2.5,2,3,1), mar=c(2,5,3,0))
par(mgp=c(4,1,0)) # Number of margin lines for title, axis labels and axis line

## Flow chart
plot_title <- paste("Flow - cfs")
plot(df$dt, df$cfs, type="l", xlab="", ylab = "Flow -cfs",xlim=c(dt_min,dt_max), yaxt="n",
main =plot_title)
axis(side=2, at=axTicks(2),
labels=formatC(axTicks(2), format="d", big.mark=','))

points(peak_dt, peak_flow, pch=16, col="red")
spacer <- 4*60*60
note <- paste("Peak @ ", peak_flow_comma, " cfs \n", peak_dt)
abline(h = seq(10000,50000,10000), col = "grey")
abline(v= seq(dt_min,dt_max,by="day"), col="grey")
text(peak_dt+ spacer,peak_flow * 0.975, note, adj=0, cex=1)
################################################
# Sheet title annotation
mtext("Flood Conditions @ Fairmount Dam (July 9-16, 2019)", side = 3, line = 3, adj = 0.5, cex = 1.2)

##### precipitation data analyis n& Chart
df$cum_precip <- cumsum(df$inches)
tot_precip <- df$cum_precip[nrow(df)] precip_st_row <- length(subset(df$cum_precip, df$cum_precip == 0)) precip_st_dt <- df$dt[precip_st_row]

precip_note <- paste0("Total Precip - ",tot_precip, " inches")
precip_subset <- subset(df, df$cum_precip == tot_precip) precip_end_dt <- mdy_hm(precip_subset[1,1]) #precip_end_dt_t <- mdy_hm(precip_end_dt) plot(df$dt, df$cum_precip, type="l", xlab="", ylab = "Precipitation -inches",xlim=c(dt_min,dt_max), main = "Cummulative Precipitation - Inches") points(precip_st_dt,0, pch=12, col = "blue") points(precip_end_dt,tot_precip, pch=12, col="blue") abline(v= seq(dt_min,dt_max,by="day"), col="grey") text(dt_min,tot_precip, precip_note, adj=0) precip_st_note <- paste0(" Precipitation Starts @ ",precip_st_dt) dur <- precip_end_dt - precip_st_dt precip_end_note <- paste0(" Precipitation ends @ ",precip_end_dt,"\n ",dur, " hours") text(precip_end_dt,tot_precip * 0.9, precip_end_note, adj=0, cex = 1) text(precip_st_dt, 0, precip_st_note, adj=0, cex = 1) #### gage height chart gage_act_df <- subset(df, df$gage >= 10)
gage_act_dt <- gage_act_df$dt[1] gage_act_note <- paste0("Gage Action Level @ \n",gage_act_dt) plot(df$dt, df$gage, type="l", xlab="", ylab = "Gage Height - Ft",xlim=c(dt_min,dt_max), main ="Gage height - ft") abline(h = 10, col="brown") abline(h=11, col = "red", lwd =1) abline(v= seq(dt_min,dt_max,by="day"), col="grey") points(gage_act_dt,gage_act_df[1,3],pch=11, col = "black") text(gage_act_dt - 1*4*3600, 10,gage_act_note, adj=1) min_gage <- min(df$gage, na.rm=T)
max_gage <- max(df\$gage, na.rm=T)
delta_gage <- max_gage - min_gage
gage_note <- paste("Max Gage @ ", max_gage, " ft\nGage Increase ", delta_gage, " ft")

flood_act_note <- "Flood Action stage @ 10-ft"
flood_stage_note <-"Flood stage @ 11-ft"
text(dt_max-60*60,10.25, flood_act_note, adj = 1, cex=0.95)
text(dt_max - 1*1*60*60,11.25, flood_stage_note, adj = 1, cex=0.95)

##########################################
# Sheet annotation - Footer Notes
mtext("K O'Day - 7/19/19", side = 1, line = 3, adj = 0, cex = 0.9)
mtext("Data Source: USGS: https://waterdata.usgs.gov/nwis/inventory/?site_no=01474500", side=1, line = 2, adj=0, cex = 0.9)

# png(file="C:\\Users\\Kelly O'Day\\Desktop\\Schuylkill_flood\\myplot.png", bg="white")

dev.off()



R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Watch keynote presentations from the useR!2019 conference

The keynote presentations from last week's useR!2019 conference in Toulouse are now available for everyone to view on YouTube. (The regular talks were also recorded and video should follow soon, and slides for most talks are available for download now at the conference website.) Here are links to the videos, indexed to the start of each presentation:

All of the presentations are excellent, but if I had to choose one to watch first, it would be Julia Stewart Lowndes' presentation, which is an inspiring example of how R has enabled marine researchers to collaborate and learn from data (like a transponder-equipped squid!).

The videos have been made available thanks to sponsorship by the R Consortium. If you're not familiar with the R Consortium, you can learn more in the short presentation below from Joe Rickert:

The R Consortium is funded by its member organizations, so if you'd like to see more of the above, consider asking your company to become a member.

### Watch keynote presentations from the useR!2019 conference

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The keynote presentations from last week's useR!2019 conference in Toulouse are now available for everyone to view on YouTube. (The regular talks were also recorded and video should follow soon, and slides for most talks are available for download now at the conference website.) Here are links to the videos, indexed to the start of each presentation:

All of the presentations are excellent, but if I had to choose one to watch first, it would be Julia Stewart Lowndes' presentation, which is an inspiring example of how R has enabled marine researchers to collaborate and learn from data (like a transponder-equipped squid!).

The videos have been made available thanks to sponsorship by the R Consortium. If you're not familiar with the R Consortium, you can learn more in the short presentation below from Joe Rickert:

The R Consortium is funded by its member organizations, so if you'd like to see more of the above, consider asking your company to become a member.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Four short links: 19 July 2019

Journal Mining, API Use, Better Conversation, and Apollo 11 Source

1. 73 Million Journal Articles for Text Mining (BoingBoing) -- The JNU Data Depot is a joint project between rogue archivist Carl Malamud, bioinformatician Andrew Lynn, and a research team from New Delhi's Jawaharlal Nehru University: together, they have assembled 73 million journal articles from 1847 to the present day and put them into an airgapped respository that they're offering to noncommercial third parties who want to perform textual analysis on them to "pull out insights without actually reading the text."
2. How Developers Use API Documentation: An Observation Study (ACM) -- participants totally mapped to opportunistic (risk-taking, paste-then-adapt, change-without-checking) developers and systematic (start with clean code, read the docs, learn before coding) developers.
3. Talk -- An open source commenting platform focused on better conversation.
4. Apollo 11 -- Original Apollo 11 Guidance Computer (AGC) source code for the command and lunar modules.

### Monash University: Lecturers / Senior Lecturers – Digital Health, Image Analytics [Melbourne, Australia]

Monash is seeking 2 x Lecturer / Senior Lecturer within the space of Digital Health - Image Analytics. Digital Health is a fascinating cross-university, cross-faculty, multidisciplinary space with enormous practical real-world and societal benefits that draws from and contributes to a multiplicity of research areas.

### Apple: Data Science Engineer [Austin, TX]

Seeking a customer-focused, passionate and driven Data Science Engineer with experience in building analytic tools and solutions.

### From Data Pre-processing to Optimizing a Regression Model Performance

All you need to know about data pre-processing, and how to build and optimize a regression model using Backward Elimination method in Python.