# My Data Science Blogs

## August 17, 2019

### If you did not already know

SUbgraph Robust REpresentAtion Learning (SURREAL)
The success of graph embeddings or node representation learning in a variety of downstream tasks, such as node classification, link prediction, and recommendation systems, has led to their popularity in recent years. Representation learning algorithms aim to preserve local and global network structure by identifying node neighborhood notions. However, many existing algorithms generate embeddings that fail to properly preserve the network structure, or lead to unstable representations due to random processes (e.g., random walks to generate context) and, thus, cannot generate to multi-graph problems. In this paper, we propose a robust graph embedding using connection subgraphs algorithm, entitled: SURREAL, a novel, stable graph embedding algorithmic framework. SURREAL learns graph representations using connection subgraphs by employing the analogy of graphs with electrical circuits. It preserves both local and global connectivity patterns, and addresses the issue of high-degree nodes. Further, it exploits the strength of weak ties and meta-data that have been neglected by baselines. The experiments show that SURREAL outperforms state-of-the-art algorithms by up to 36.85% on multi-label classification problem. Further, in contrast to baselines, SURREAL, being deterministic, is completely stable. …

KAMILA Clustering (KAMILA)
KAMILA clustering, a novel method for clustering mixed-type data in the spirit of k-means clustering. It does not require dummy coding of variables, and is efficient enough to scale to rather large data sets. …

Shake-Shake Regularization
The method introduced in this paper aims at helping deep learning practitioners faced with an overfit problem. The idea is to replace, in a multi-branch network, the standard summation of parallel branches with a stochastic affine combination. Applied to 3-branch residual networks, shake-shake regularization improves on the best single shot published results on CIFAR-10 and CIFAR-100 by reaching test errors of 2.86% and 15.85%. Experiments on architectures without skip connections or Batch Normalization show encouraging results and open the door to a large set of applications. Code is available at https://…/shake-shake.
Review: Shake-Shake Regularization (Image Classification)

Multi-Layer Fast ISTA (ML-FISTA)
Parsimonious representations in data modeling are ubiquitous and central for processing information. Motivated by the recent Multi-Layer Convolutional Sparse Coding (ML-CSC) model, we herein generalize the traditional Basis Pursuit regression problem to a multi-layer setting, introducing similar sparse enforcing penalties at different representation layers in a symbiotic relation between synthesis and analysis sparse priors. We propose and analyze different iterative algorithms to solve this new problem in practice. We prove that the presented multi-layer Iterative Soft Thresholding (ML-ISTA) and multi-layer Fast ISTA (ML-FISTA) converge to the global optimum of our multi-layer formulation at a rate of $\mathcal{O}(1/k)$ and $\mathcal{O}(1/k^2)$, respectively. We further show how these algorithms effectively implement particular recurrent neural networks that generalize feed-forward architectures without any increase in the number of parameters. We demonstrate the different architectures resulting from unfolding the iterations of the proposed multi-layer pursuit algorithms, providing a principled way to construct deep recurrent CNNs from feed-forward ones. We demonstrate the emerging constructions by training them in an end-to-end manner, consistently improving the performance of classical networks without introducing extra filters or parameters. …

### Fresh from the Python Package Index

exceldriver
Tool for automating excel actions on Windows

experimentlogger-jathr
Simple logger for Machine Learning experiments

hip-data-tools
Common utility functions for data engineering usecases

hmmkay
Discrete Hidden Markov Models with Numba

liepa-tts
Python bindings for Lithuanian language synthesizer from LIEPA project

matplot

matrix-operations
various matrix operations

A fully formatted dictionary for everything you’ll ever need.

mothnet
Neural network modeled after the olfactory system of the hawkmoth

PVPolyfit
A high-resolution multiple linear regression algorithm used to analyze PV output with a few inputs

pyatlasclient
Apache Atlas Python Client

rdilmarkdown
Markdown in Python.

semvecpy
Semantic Vectors work in Python

skymind-pipelines
pipelines: Deploy your machine learning experiments with Skymind Pipelines

TensorFI-BinaryFI
A Binary fault injection tool for TensorFlow-based program

text-normalization
A text normalization package

torch-kerosene
Pytorch Framework For Medical Image Analysis

veho
A helper for working with 1d & 2d array.

### Magister Dixit

“… Note that playing back data from Hadoop into SAP Sybase ESP can occur much faster than in real time, …” SAP ( 2013 )

### Document worth reading: “Why Machines Cannot Learn Mathematics, Yet”

Nowadays, Machine Learning (ML) is seen as the universal solution to improve the effectiveness of information retrieval (IR) methods. However, while mathematics is a precise and accurate science, it is usually expressed by less accurate and imprecise descriptions, contributing to the relative dearth of machine learning applications for IR in this domain. Generally, mathematical documents communicate their knowledge with an ambiguous, context-dependent, and non-formal language. Given recent advances in ML, it seems canonical to apply ML techniques to represent and retrieve mathematics semantically. In this work, we apply popular text embedding techniques to the arXiv collection of STEM documents and explore how these are unable to properly understand mathematics from that corpus. In addition, we also investigate the missing aspects that would allow mathematics to be learned by computers. Why Machines Cannot Learn Mathematics, Yet

### Modern R with the tidyverse is available on Leanpub

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Yesterday I released an ebook on Leanpub,
called Modern R with the tidyverse, which you can also

In this blog post, I want to give some context.

Modern R with the tidyverse is the second ebook I release on Leanpub. I released the first one, called
Functional programming and unit testing for data munging with R around
here) . I just had moved back to my home country of
Luxembourg and started a new job as a research assistant at the statistical national institute.
Since then, lots of things happened; I’ve changed jobs and joined PwC Luxembourg as a data scientist,
was promoted to manager, finished my PhD, and most importantly of all, I became a father.

Through all this, I continued blogging and working on a new ebook, called Modern R with the tidyverse.
At first, this was supposed to be a separate book from the first one, but as I continued writing,
I realized that updating and finishing the first one, would take a lot of effort, and also, that
it wouldn’t make much sense in keeping both separated. So I decided to merge the content from the
first ebook with the second, and update everything in one go.

My very first notes were around 50 pages if memory serves, and I used them to teach R at the
University of Strasbourg while I employed there as a research and teaching assistant and working
on my PhD. These notes were the basis of Functional programming and unit testing for data munging with R
and now Modern R. Chapter 2 of Modern R is almost a simple copy and paste from these notes
(with more sections added). These notes were first written around 2012-2013ish.

Modern R is the kind of text I would like to have had when I first started playing around with R,
sometime around 2009-2010. It starts from the beginning, but also goes quite into details in the
later chapters. For instance, the section on
modeling with functional programming
is quite advanced, but I believe that readers that read through all the book and reached that part
would be armed with all the needed knowledge to follow. At least, this is my hope.

Now, the book is still not finished. Two chapters are missing, but it should not take me long to
finish them as I already have drafts lying around. However, exercises might still be in wrong
places, and more are required. Also, generally, more polishing is needed.

As written in the first paragraph of this section, the book is available on
Leanpub. Unlike my previous ebook, this one costs money;
a minimum price of 4.99$and a recommended price of 14.99$, but as mentioned you can read it for
free online. I’ve hesitated to give it a minimum price of
• International candidates will receive a tax-free stipend of $30,000(AUD) per annum for up to 3 years to support living costs. Those with a strong track record will be eligible for a tuition fee waiver. • Support for conference attendance, fieldwork and additional costs as approved by the School. International candidates are required to hold an Overseas Student Health Care (OSHC)(opens in new window)insurance policy for the duration their study in Australia. This cost is not covered by the scholarship. Follow @NuitBlog or join the CompressiveSensing Reddit, the Facebook page, the Compressive Sensing group on LinkedIn or the Advanced Matrix Factorization group on LinkedIn Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email. Other links: Paris Machine LearningMeetup.com||@Archives||LinkedIn||Facebook|| @ParisMLGroup< br/> About LightOnNewsletter ||@LightOnIO|| on LinkedIn || on CrunchBase || our Blog About myselfLightOn || Google Scholar || LinkedIn ||@IgorCarron ||Homepage||ArXiv Continue Reading… ### How (Not) To Scale Deep Learning in 6 Easy Steps ## Introduction: The Problem Deep learning sometimes seems like sorcery. Its state-of-the-art applications are at times delightful and at times disturbing. The tools that achieve these results are, amazingly, mostly open source, and can work their magic on powerful hardware available to rent by the hour in the cloud. It’s no wonder that companies are eager to apply deep learning for more prosaic business problems like better churn prediction, image curation, chatbots, time series analysis and more. Just because the tools are readily available doesn’t mean they’re easy to use well. Even choosing the right architecture, layers and activations is more art than science. This blog won’t examine how to tune a deep learning architecture for accuracy. That process does, however, require training lots of models in a process of trial and error. This leads to a more immediate issue: scaling up the performance of deep learning training. Tuning deep learning training doesn’t work like tuning an ETL job. It requires a large amount of compute from specialized hardware, and everyone eventually finds deep learning training ‘too slow’. Too often, users reach for solutions that may be overkill, expensive and not faster, when trying to scale up, while overlooking some basic errors that hurt performance. This blog will instead walk through basic steps to avoid common performance pitfalls in training, and then the right steps, in order, to scale up by applying more complex tooling and more hardware. Hopefully, you will find your modeling job can move along much faster without reaching immediately for a cluster of extra GPUs. ## A Simple Classification Task Because the focus here is not on the learning problem per se, the following examples will develop a simple data set and problem to solve: classifying the Caltech 256 dataset of about 30,000 images each into one of 257 (yes, 257) categories. The data consists of JPEG files. These need to be resized to common dimensions, 299×299, to match the pre-trained base layer described below. The images are then written to Parquet files with labels to facilitate larger-scale training, described later. This can be accomplished with the ‘binary’ files data source in Apache Spark. See the accompanying notebook for full source code, but these are the highlights: img_size = 299 def scale_image(image_bytes): image = Image.open(io.BytesIO(image_bytes)).convert('RGB') image.thumbnail((img_size, img_size), Image.ANTIALIAS) x, y = image.size with_bg = Image.new('RGB', (img_size, img_size), (255, 255, 255)) with_bg.paste(image, box=((img_size - x) // 2, (img_size - y) // 2)) return with_bg.tobytes() ... raw_image_df = spark.read.format("binaryFile").\ option("pathGlobFilter", "*.jpg").option("recursiveFileLookup", "true").\ load(caltech_256_path).repartition(64) image_df = raw_image_df.select( file_to_label_udf("path").alias("label"), scale_image_udf("content").alias("image")).cache() (train_image_df, test_image_df) = image_df.randomSplit([0.9, 0.1], seed=42) ... train_image_df.write.option("parquet.block.size", 1024 * 1024).\ parquet(table_path_base + "train") test_image_df.write.option("parquet.block.size", 1024 * 1024).\ parquet(table_path_base + "test")  It’s also possible to use Spark’s built-in ‘image’ data source type to read these as well. Keras, the popular high-level front end for Tensorflow, can describe a straightforward deep learning model to classify the images. There’s no need to build an image classifier from scratch. Instead, this example reuses the pretrained Xception model built into Keras and adds a dense layer on top to classify. (Note that this example uses Keras as included with Tensorflow 1.13.1, in tensorflow.keras, rather than standalone Keras 2.2.4). The pretrained layers themselves will not be trained further. Take that as step #0: use transfer learning and pretrained models when working with images! ### Step #1: Use a GPU Almost the only situation where it makes sense to train a deep learning model on a CPU is when there are no GPUs available. When working in the cloud, on a platform like Databricks, it’s trivial to provision a machine with a GPU with all the drivers and libraries ready. This example will jump straight into training this model on a single K80 GPU. This first pass will just load a 10% sample of the data from Parquet as a pandas DataFrame, reshape the image data, and train in memory on 90% of that sample. Here, training just runs for 60 epochs on a small batch size. Small side tip: when using a pretrained network, it’s essential to normalize the image values to the range the network expects. Here, that’s [-1,1], and Keras provides a preprocess_input function to do this. (Note: to run this example on Databricks, select the 5.5 ML Runtime or later with GPU support, and choose a driver instance type with a single GPU. Because the example also uses Spark, you will have to also provision 1 worker.) df_pd = spark.read.parquet("...").sample(0.1, seed=42).toPandas() X_raw = df_pd["image"].values X = np.array( [preprocess_input( np.frombuffer(X_raw[i], dtype=np.uint8).reshape((img_size,img_size,3))) for i in range(len(X_raw))]) y = df_pd["label"].values - 1 # -1 because labels are 1-based X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42) ... def build_model(dropout=None): model = Sequential() xception = Xception(include_top=False, input_shape=(img_size,img_size,3), pooling='avg') for layer in xception.layers: layer.trainable = False model.add(xception) if dropout: model.add(Dropout(dropout)) model.add(Dense(257, activation='softmax')) return model model = build_model() model.compile(optimizer=Nadam(lr=0.001), loss='sparse_categorical_crossentropy', metrics=['accuracy']) model.fit(X_train, y_train, batch_size=2, epochs=60, verbose=2) model.evaluate(X_test, y_test) ... Epoch 58/60 - 65s - loss: 0.2787 - acc: 0.9280 Epoch 59/60 - 65s - loss: 0.3425 - acc: 0.9106 Epoch 60/60 - 65s - loss: 0.3525 - acc: 0.9173 ... [1.913768016828665, 0.7597173]  The results look good — 91.7% accuracy! However, there’s an important flaw. The final evaluation on the held-out 10% validation data shows that true accuracy is more like 76%. Actually, the model has overfitted. That’s not good, but worse, it means that most of the time it spent training was spent making it a little worse. It should have ended when accuracy on the validation data stopped decreasing. Not only would that have left a better model, it would have completed faster. ### Step #2: Use Early Stopping Keras (and other frameworks) have built-in support for stopping when further training appears to be making the model worse. In Keras, it’s the EarlyStopping callback. Using it means passing the validation data to the training process for evaluation on every epoch. Training will stop after several epochs have passed with no improvement. restore_best_weights=True ensures that the final model’s weights are from its best epoch, not just the last one. This should be your default. ... early_stopping = EarlyStopping(patience=3, monitor='val_acc', min_delta=0.001, restore_best_weights=True, verbose=1) model.fit(X_train, y_train, batch_size=2, epochs=60, verbose=2, validation_data=(X_test, y_test), callbacks=[early_stopping]) model.evaluate(X_test, y_test) ... Epoch 12/60 - 74s - loss: 0.9468 - acc: 0.7689 - val_loss: 1.2728 - val_acc: 0.7597 Epoch 13/60 - 75s - loss: 0.8886 - acc: 0.7795 - val_loss: 1.4035 - val_acc: 0.7456 Epoch 14/60 Restoring model weights from the end of the best epoch. - 80s - loss: 0.8391 - acc: 0.7870 - val_loss: 1.4467 - val_acc: 0.7420 Epoch 00014: early stopping ... [1.3035458562230895, 0.7597173]  Now, training stops in 14 epochs, not 60, and 18 minutes. Each epoch took a little longer (75s vs 65s) because of the evaluation of the validation data. Accuracy is better too, at 76.7%. With early stopping, note that the number of epochs passed to fit() only matters as a limit on the maximum number of epochs that will run. It can be set to a large value. This is the first a couple observations here that suggest the same thing: epochs don’t really matter as a unit of training. They’re just a number of batches of data that constitute the whole input to training. But training means passing over the data in batches repeatedly until the model is trained enough. How many epochs that represents isn’t directly important. An epoch is still useful as a point of comparison for time taken to train per amount of data though. ### Step #3: Max Out GPU with Larger Batch Sizes In Databricks, cluster metrics are exposed through a Ganglia-based UI. This shows GPU utilization during training. Monitoring utilization is important to tuning as it can suggest bottlenecks. Here, the GPU is pretty well used at about 90%: 100% is cooler than 90%. The batch size of 2 is small, and isn’t keeping the GPU busy enough during processing. Increasing the batch size would increase that utilization. The goal isn’t only to make the GPU busier, but to benefit from the extra work. Bigger batches improve how well each batch updates the model (up to a point) with more accurate gradients. That in turn can allow training to use a higher learning rate, and more quickly reach the point where the model stops improving. Or, with extra capacity, it’s possible to add complexity to the network architecture itself to take advantage of that. This example doesn’t intend to explore tuning the architecture, but will try adding some dropout to decrease this network’s tendency to overfit. model = build_model(dropout=0.5) model.compile(optimizer=Nadam(lr=0.004), loss='sparse_categorical_crossentropy', metrics=['accuracy']) model.fit(X_train, y_train, batch_size=16, epochs=30, verbose=2, validation_data=(X_test, y_test), callbacks=[early_stopping]) … Epoch 6/30 - 56s - loss: 0.1487 - acc: 0.9583 - val_loss: 1.1105 - val_acc: 0.7633 Epoch 7/30 - 56s - loss: 0.1022 - acc: 0.9717 - val_loss: 1.2128 - val_acc: 0.7456 Epoch 8/30 - 56s - loss: 0.0853 - acc: 0.9744 - val_loss: 1.2004 - val_acc: 0.7597 Epoch 9/30 Restoring model weights from the end of the best epoch. - 62s - loss: 0.0811 - acc: 0.9815 - val_loss: 1.2424 - val_acc: 0.7350 Epoch 00009: early stopping  With a larger batch size of 16 instead of 2, and learning rate of 0.004 instead of 0.001, the GPU crunches through epochs in under 60s instead of 75s. The model reaches about the same accuracy (76.3%) in only 9 epochs. Total train time was just 9 minutes, much better than 65. It’s all too easy to increase the learning rate too far, in which case training accuracy will be poor and stay poor. When increasing the batch size by 8x, it’s typically advisable to increase learning rate by at most 8x. Some research suggests that when the batch size increases by N, the learning rate can scale by about sqrt(N). Note that there is some randomness inherent in the training process, as inputs are shuffled by Keras. Accuracy fluctuates mostly up but sometimes down over time, and coupled with early stopping, training might terminate earlier or later depending on the order the data is encountered. To even this out, the ‘patience’ of EarlyStopping can be increased at the cost of extra training at the end. ### Step #4: Use Petastorm and /dbfs/ml to Access Large Data Training above used just a 10% sample of the data, and the tips above helped bring training time down by adopting a few best practices. The next step, of course, is to train on all of the data. This should help achieve higher accuracy, but means more data will have to be processed too. The full data set is many gigabytes, which could still fit in memory, but for purposes here, let’s pretend it wouldn’t. Data needs to be loaded efficiently in chunks into memory during training with a different approach. Fortunately, the Petastorm library from Uber is designed to feed Parquet-based data into Tensorflow (or Keras) training in this way. It can be applied by adapting the preprocessing and training code to create Tensorflow Datasets, rather than pandas DataFrames, for training. Datasets here act like infinite iterators over the data, which means steps_per_epoch is now defined to specify how many batches make an epoch. This underscores how an ‘epoch’ is somewhat arbitrary. It’s also common to checkpoint model training progress in long-running training jobs, to recover from failures during training. This is also added as a callback. (Note: To run this example, attach the petastorm library to your cluster.) path_base = "/dbfs/.../" checkpoint_path = path_base + "checkpoint" table_path_base = path_base + "caltech_256_image/" table_path_base_file = "file:" + table_path_base train_size = spark.read.parquet(table_path_base_file + "train").count() test_size = spark.read.parquet(table_path_base_file + "test").count() # Workaround for Arrow issue: underscore_files = [f for f in (os.listdir(table_path_base + "train") + os.listdir(table_path_base + "test")) if f.startswith("_")] pq.EXCLUDED_PARQUET_PATHS.update(underscore_files) img_size = 299 def transform_reader(reader, batch_size): def transform_input(x): img_bytes = tf.reshape(decode_raw(x.image, tf.uint8), (-1,img_size,img_size,3)) inputs = preprocess_input(tf.cast(img_bytes, tf.float32)) outputs = x.label - 1 return (inputs, outputs) return make_petastorm_dataset(reader).map(transform_input).\ apply(unbatch()).shuffle(400, seed=42).\ batch(batch_size, drop_remainder=True)  The method above reimplements some of the preprocessing from earlier code in terms of Tensorflow’s transformation APIs. Note that Petastorm produces Datasets that deliver data in batches that depends entirely on the Parquet files’ row group size. To control the batch size for training, it’s necessary to use Tensorflow’s unbatch() and batch() operations to re-batch the data into the right size. Also, note the small workaround that’s currently necessary to avoid a problem in reading Parquet files via Arrow in Petastorm. batch_size = 16 with make_batch_reader(table_path_base_file + "train", num_epochs=None) as train_reader: with make_batch_reader(table_path_base_file + "test", num_epochs=None) as test_reader: train_dataset = transform_reader(train_reader, batch_size) test_dataset = transform_reader(test_reader, batch_size) model = build_model(dropout=0.5) model.compile(optimizer=Nadam(lr=0.004), loss='sparse_categorical_crossentropy', metrics=['acc']) early_stopping = EarlyStopping(patience=3, monitor='val_acc', min_delta=0.001, restore_best_weights=True, verbose=1) # Note: you must set save_weights_only=True to avoid problems with hdf5 files and /dbfs/ml checkpoint = ModelCheckpoint(checkpoint_path + "/checkpoint-{epoch}.ckpt", save_weights_only=True, verbose=1) model.fit(train_dataset, epochs=30, steps_per_epoch=(train_size // batch_size), validation_data=test_dataset, validation_steps=(test_size // batch_size), verbose=2, callbacks=[early_stopping, checkpoint])  More asides: for technical reasons, currently ModelCheckpoint must set save_weights_only=True when using /dbfs. It also appears necessary to use different checkpoint paths per epoch; use a path pattern that includes {epoch}. Now run: Epoch 8/30 Epoch 00008: saving model to /dbfs/tmp/sean.owen/binary/checkpoint/checkpoint-8.ckpt - 682s - loss: 1.0154 - acc: 0.8336 - val_loss: 1.2391 - val_acc: 0.8301 Epoch 9/30 Epoch 00009: saving model to /dbfs/tmp/sean.owen/binary/checkpoint/checkpoint-9.ckpt. - 684s - loss: 1.0048 - acc: 0.8397 - val_loss: 1.2900 - val_acc: 0.8275 Epoch 10/30 Epoch 00010: saving model to /dbfs/tmp/sean.owen/binary/checkpoint/checkpoint-10.ckpt - 689s - loss: 1.0033 - acc: 0.8422 - val_loss: 1.3706 - val_acc: 0.8225 Epoch 11/30 Restoring model weights from the end of the best epoch. Epoch 00011: saving model to /dbfs/tmp/sean.owen/binary/checkpoint/checkpoint-11.ckpt - 687s - loss: 0.9800 - acc: 0.8503 - val_loss: 1.3837 - val_acc: 0.8225 Epoch 00011: early stopping  Epoch times are almost 11x longer, but recall that an epoch here is now a full pass over the training data, not a 10% sample. The extra overhead comes from the I/O in reading data from Parquet in cloud storage, and writing checkpoint files. The GPU utilization graph manifests this in “spiky” utilization of the GPU: The upside? Accuracy is significantly better at 83%. The cost was much longer training time: 126 minutes instead of 9. For many applications, this could be well worth it. Databricks provides an optimized implementation of the file system mount that makes the Parquet files appear as local files to training. Accessing them via /dbfs/ml/… instead of /dbfs/… can improve I/O performance. Also, Petastorm itself can cache data on local disks to avoid re-reading data from cloud storage. path_base = "/dbfs/ml/..." checkpoint_path = path_base + "checkpoint" table_path_base = path_base + "caltech_256_image/" table_path_base_file = "file:" + table_path_base def make_caching_reader(suffix, cur_shard=None, shard_count=None): return make_batch_reader(table_path_base_file + suffix, num_epochs=None, cur_shard=cur_shard, shard_count=shard_count, cache_type='local-disk', cache_location="/tmp/" + suffix, cache_size_limit=20000000000, cache_row_size_estimate=img_size * img_size * 3)  The rest of the code is as above, just using make_caching_reader in place of make_reader. Epoch 6/30 Epoch 00006: saving model to /dbfs/ml/tmp/sean.owen/binary/checkpoint/checkpoint-6.ckpt - 638s - loss: 1.0221 - acc: 0.8252 - val_loss: 1.1612 - val_acc: 0.8285 ... Epoch 00009: early stopping  The training time decreased from about 126 minutes to 96 minutes for roughly the same result. That’s still more than 10x the runtime for 10x the data, but not bad for a 7% increase in accuracy. ### Step #5: Use Multiple GPUs Still want to go faster, and have some budget? It’s easy to try a bigger GPU like a V100 and retune appropriately. However, at some point, scaling up means multiple GPUs. Instances with, for example, eight K80 GPUs are readily available in the cloud. Keras provides a simple utility function called multi_gpu_model that can parallelize training across multiple GPUs. It’s just a one-line code change: num_gpus = 8 ... model = multi_gpu_model(model, gpus=num_gpus)  (Note: to run this example, choose a driver instance type with 8 GPUs.) The modification was easy, but, to cut to the chase without repeating the training output: per-epoch time becomes 270s instead of 630s. That’s not 8x faster, not even 3x faster. Each of the 8 GPUs is only processing 1/8th of each batch of 16 inputs, so each is again effectively processing just 2 per batch. As above, it’s possible to increase the batch size by 8x to compensate, to 256, and further increase the learning rate to 0.016. (See the accompanying notebook for full code listings.) It reveals that training is faster, at 135s per epoch. The speedup is better, but still not 8x. Accuracy is steady at around 83%, so this still progresses towards faster training. The Keras implementation is simple, but not optimal. GPU utilization remains spiky because the GPUs idle while Keras combines partial gradients in a straightforward but slow way. Horovod is another project from Uber that helps scale deep learning training across not just multiple GPUs on one machine, but GPUs across many machines, and with great efficiency. While it’s often associated with training across multiple machines, that’s not actually the next step in scaling up. It can help this current multi-GPU setup. All else equal, it’ll be more efficient to utilize 8 GPUs connected to the same VM than spread across the network. It requires a different modification to the code, which uses the HorovodRunner utility from Databricks to integrate Horovod with Spark: batch_size = 32 num_gpus = 8 def train_hvd(): hvd.init() config = tf.ConfigProto() config.gpu_options.allow_growth = True config.gpu_options.visible_device_list = str(hvd.local_rank()) K.set_session(tf.Session(config=config)) pq.EXCLUDED_PARQUET_PATHS.update(underscore_files) with make_caching_reader("train", cur_shard=hvd.rank(), shard_count=hvd.size()) as train_reader: with make_caching_reader("test", cur_shard=hvd.rank(), shard_count=hvd.size()) as test_reader: train_dataset = transform_reader(train_reader, batch_size) test_dataset = transform_reader(test_reader, batch_size) model = build_model(dropout=0.5) optimizer = Nadam(lr=0.016) optimizer = hvd.DistributedOptimizer(optimizer) model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['acc']) callbacks = [hvd.callbacks.BroadcastGlobalVariablesCallback(0), hvd.callbacks.MetricAverageCallback(), EarlyStopping(patience=3, monitor='val_acc', min_delta=0.001, restore_best_weights=True, verbose=(1 if hvd.rank() == 0 else 0))] if hvd.rank() == 0: callbacks.append(ModelCheckpoint( checkpoint_path + "/checkpoint-{epoch}.ckpt", save_weights_only=True, verbose=1)) model.fit(train_dataset, epochs=30, steps_per_epoch=(train_size // (batch_size * num_gpus)), validation_data=test_dataset, validation_steps=(test_size // (batch_size * num_gpus)), verbose=(2 if hvd.rank() == 0 else 0), callbacks=callbacks) hr = HorovodRunner(np=-num_gpus) hr.run(train_hvd)  Again a few notes: • The Arrow workaround must be repeated in the Horovod training function • Use hvd.callbacks.MetricAverageCallback to correctly average validation metrics • Make sure to only run checkpoint callbacks on one worker (rank 0) • Set HorovodRunner’s np= argument to minus the number of GPUs to use, when local • Batch size here is now per GPU, not overall. Note the different computation in steps_per_epoch The output from the training is, well, noisy and so won’t be copied here in full. Total training time has come down to about 12.6 minutes, from 96, or almost 7.6x, which is satisfyingly close to the maximum possible 8x speedup! Accuracy is up to 83.5%. Compare to 9 minutes and 76% accuracy on one GPU. ### Step #6: Use Horovod Across Multiple Machines Sometimes, 8 or even 16 GPUs just isn’t enough, and that’s the most you can get on one machine today. Or, sometimes it can be cheaper to provision GPUs across many smaller machines to take advantage of varying prices per machine type in the cloud. The same Horovod example above can run on a cluster of 8 1-GPU machines instead of 1 8-GPU machine with just a single line of change. HorovodRunner manages the distributed work of Horovod on the Spark cluster by using Spark 2.4’s barrier mode support. num_gpus = 8 ... hr = HorovodRunner(np=num_gpus)  (Note: to run this example, provision a cluster with 8 workers, each with 1 GPU.) The only change is to specify 8, rather than -8, to select 8 GPUs on the cluster rather than on the driver. GPU utilization is pleasingly full across 8 machines’ GPUs (the idle one is the driver, which does not participate in the training): Accuracy is again about the same as expected, at 83.6%. Total run time is almost 17 minutes rather than 12.6, which reflects the overhead of coordinating GPUs across machines. This overhead could be worthwhile in some cases for cost purposes, and is simply a necessary evil if a training job has to scale past 16 GPUs. Where possible, allocating all the GPUs on one machine is faster though. For a problem of this moderate size, it probably won’t be possible to usefully exploit more GPU resources. Keeping them busy would mean larger learning rates and the learning rate is already about as high as it can go. For this network, a few K80 GPUs may be the right maximum amount of resource to deploy. Of course, there are much larger networks and datasets out there! ## Conclusion Deep learning is powerful magic, but we always want it to go faster. It scales in different ways though. There are new best practices and pitfalls to know when setting out to train a model. A few of these helped the small image classification problem here improve accuracy slightly while reducing runtime 7x. The first steps in scaling aren’t more resources, but looking for easy optimizations. Scaling to train on an entire large data set in the cloud requires some new tools, but not necessarily more GPUs at first. With careful use of Petastorm and /dbfs/ml, 10x the data helped achieve 82.7% accuracy is not much more than 10x the time on the same hardware. The next step of scaling up means utilizing multiple GPUs with tools like Horovod, but doesn’t mean a cluster of machines necessarily, unlike in ETL jobs where a cluster of machines is the norm. A single 8 GPU instance allowed training to finish almost 8x faster and achieve over 83% accuracy. Only for the largest problems are multiple GPU instances necessary, but Horovod can help scale even there without much overhead. -- Try Databricks for free. Get started today. The post How (Not) To Scale Deep Learning in 6 Easy Steps appeared first on Databricks. Continue Reading… ### Magister Dixit “The hope is that if we can start building the right models to find the right patterns using the right data, then maybe we can start making progress on some of these complicated systems.” Eric Jonas Continue Reading… ### Book Memo: “Centrality and Diversity in Search”  Roles in A.I., Machine Learning, Social Networks, and Pattern Recognition The concepts of centrality and diversity are highly important in search algorithms, and play central roles in applications of artificial intelligence (AI), machine learning (ML), social networks, and pattern recognition. This work examines the significance of centrality and diversity in representation, regression, ranking, clustering, optimization, and classification. The text is designed to be accessible to a broad readership. Requiring only a basic background in undergraduate-level mathematics, the work is suitable for senior undergraduate and graduate students, as well as researchers working in machine learning, data mining, social networks, and pattern recognition. Continue Reading… ### Really large numbers in R [This article was first published on R – Open Source Automation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This post will discuss ways of handling huge numbers in R using the gmp package. ## The gmp package The gmp package provides us a way of dealing with really large numbers in R. For example, let’s suppose we want to multiple 10250 by itself. Mathematically we know the result should be 10500. But if we try this calculation in base R we get Inf for infinity. num = 10^250 num^2 # Inf  However, we can get around this using the gmp package. Here, we can convert the integer 10 to an object of the bigz class. This is an implementation that allows us to handle very large numbers. Once we convert an integer to a bigz object, we can use it to perform calculations with regular numbers in R (there’s a small caveat coming). library(gmp) num = as.bigz(10) (num^250) * (num^250) # or directly 10^500 num^500  One note that we need to be careful about is what numbers we use to convert to bigz objects. In the example above, we convert the integer 10 to bigz. This works fine for our calculations because 10 is not a very large number in itself. However, let’s suppose we had converted 10250 to a bigz object instead. If we do this, the number 10250 becomes a double data type, which causes a loss in precision for such a number. Thus the result we see below isn’t really 10250: num = 10^250 as.bigz(num) num  A way around this is to input the number we want as a character into as.bigz. For example, we know that 10250 is the number 1 followed by 250 zeros. We can create a character that represents this number like below: num = paste0("1", paste(rep("0", 250), collapse = ""))  Thus, we can use this idea to create bigz objects: as.bigz(num)  In case you run into issues with the above line returning an NA value, you might want to try turning scientific notation off. You can do that using the base options command. options(scipen = 999)  If scientific notation is not turned off, you may have cases where the character version of the number looks like below, which results in an NA being returned by as.bigz. “1e250” In general, numbers can be input to gmp functions as characters to avoid this or other precision issues. ## Finding the next prime The gmp package can find the first prime larger than an input number using the nextprime function. num = "100000000000000000000000000000000000000000000000000" nextprime(num)  ## Find the GCD of two huge numbers We can find the GCD of two large numbers using the gcd function: num = "2452345345234123123178" num2 = "23459023850983290589042" gcd(num, num2) # returns 2  ## Factoring numbers into primes gmp also provides a way to factor numbers into primes. We can do this using the factorize function. num = "2452345345234123123178" factorize(num)  ## Matrices of large numbers gmp also supports creating matrices with bigz objects. num1 <- "1000000000000000000000000000" num2 <- "10000000000000000000000000000000" num3 <- "100000000000000000000000000000000000000" num4 <- "100000000000000000000000000000000000000000000000" nums <- c(as.bigz(num1), as.bigz(num2), as.bigz(num3), as.bigz(num4)) matrix(nums, nrow = 2)  We can also perform typical operations with our matrix, like find its inverse, using base R functions: solve(m)  ## Sampling random (large) numbers uniformly We can sample large numbers from a discrete uniform distribution using the urand.bigz function. urand.bigz(nb = 100, size = 5000, seed = 0)  The nb parameter represents how many integers we want to sample. Thus, in this example, we’ll get 100 integers returned. size = 5000 tells the function to sample the integers from the inclusive range of 0 to 25000 – 1. In general you can sample from the range 0 to 2size – 1. To learn more about gmp, click here for its vignette. If you enjoyed this post, click here to follow my blog on Twitter. The post Really large numbers in R appeared first on Open Source Automation. To leave a comment for the author, please follow the link and comment on their blog: R – Open Source Automation. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Continue Reading… ### Converting lines in an svg image to csv [This article was first published on The Shape of Code » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. During a search for data on programming language usage I discovered Stack Overflow Trends, showing an interesting plot of language tags appearing on Stack Overflow questions (see below). Where was the csv file for these numbers? Somebody had asked this question last year, but there were no answers. The graphic is in svg format; has anybody written an svg to csv conversion tool? I could only find conversion tools for specialist uses, e.g., geographical data processing. The svg file format is all xml, and using a text editor I could see the numbers I was after. How hard could it be (it had to be easier than a png heatmap)? Extracting the x/y coordinates of the line segments for each language turned out to be straight forward (after some trial and error). The svg generation process made matching language to line trivial; the language name was included as an xml attribute. Programmatically extracting the x/y axis information exhausted my patience, and I hard coded the numbers (code+data). The process involves walking an xml structure and R’s list processing, two pet hates of mine (the data is for a book that uses R, so I try to do everything data related in R). I used R’s xml2 package to read the svg files. Perhaps if my mind had a better fit to xml and R lists, I would have been able to do everything using just the functions in this package. My aim was always to get far enough down to convert the subtree to a data frame. Extracting data from graphs represented in svg files is so easy (says he). Where is the wonderful conversion tool that my search failed to locate? Pointers welcome. To leave a comment for the author, please follow the link and comment on their blog: The Shape of Code » R. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Continue Reading… ### R Packages worth a look Extension to ‘spatstat’ for Large Datasets on a Linear Network (spatstat.Knet) Extension to the ‘spatstat’ package, for analysing large datasets of spatial points on a network. Provides a memory-efficient algorithm for computing the geometrically-corrected K function, described in S. Rakshit, A. Baddeley and G. Nair (2019) <doi:10.18637/jss.v090.i01> Create Trusted Timestamps of Datasets and Files (trustedtimestamping) Trusted Timestamps (tts) are created by incorporating a hash of a file or dataset into a transaction on the decentralized blockchain (Stellar network). The package makes use of a free service provided by <https://stellarapi.io>. Manage the Life Cycle of your Package Functions (lifecycle) Manage the life cycle of your exported functions with shared conventions, documentation badges, and non-invasive deprecation warnings. The ‘lifecycle’ package defines four development stages (experimental, maturing, stable, and questioning) and three deprecation stages (soft-deprecated, deprecated, and defunct). It makes it easy to insert badges corresponding to these stages in your documentation. Usage of deprecated functions are signalled with increasing levels of non-invasive verbosity. Markov Random Field Structure Estimator (mrfse) A Markov random field structure estimator that uses a penalized maximum conditional likelihood method similar to the Bayesian Information Criterion (Frondana, 2016) <doi:10.11606/T.45.2018.tde-02022018-151123>. Model Butcher (butcher) Provides a set of five S3 generics to axe components of fitted model objects and help reduce the size of model objects saved to disk. Parsing Semi-Structured Log Files into Tabular Format (tabulog) Convert semi-structured log files (such as ‘Apache’ access.log files) into a tabular format (data.frame) using a standard template system. Robust Quality Control Chart (rQCC) Constructs robust quality control chart based on the median and Hodges-Lehmann estimators (location) and the median absolute deviation (MAD) and Shamos estimators (scale) which are unbiased with a sample of finite size. For more details, see Park, Kim and Wang (2019)<arXiv:1908.00462>. This work was partially supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (No. NRF-2017R1A2B4004169). Continue Reading… ### Legal weed is linked to higher junk-food sales Research suggests marijuana really does give you the munchies Continue Reading… ## August 15, 2019 ### Distilled News Semantic segmentation. My absolute favorite task. (More than NLP you ask? Yes.) I would make a deep learning model, have it all nice and trained… but wait. How do I know my model is performing well? In other words, what are the most common metrics for semantic segmentation? Here’s a clear cut guide to the essential metrics that you need to know to ensure your model is ?? ??. I have also included Keras implementations below. In this post, we look at various tips that can be useful when automating R application testing and continuous integration, with regards to orchestrating parallelization, combining sources from multiple git repositories and ensuring proper access right to the Jenkins agent. The other day, I went on Reddit to ask if I should use Python for ETL related transformations, and the overwhelming response was yes. However, while my fellow Redditors enthusiastically supported using Python, they advised looking into libraries outside of Pandas – citing concerns about Pandas performance with large datasets. After doing some research, I found a ton of Python libraries built for data transformation: some improve Pandas performance, while others offer their own solutions. I couldn’t find a comprehensive list of these tools, so I thought I’d compile one using the research I did – if I missed something or got something wrong, please let me know! I got this wonderful opportunity to work on the project ‘Anomaly detection in Martian Surface’ though Omdena community. The objective of this project is to detect the Anomalies on the martian (MARS) surface caused by non-terrestrial artifacts like derbies of MARS lander missions, rovers, etc. Recently the search for so-called ‘Techno-Signatures’ – measurable properties or effect that provide scientific evidence of past or present extraterrestrial technology, has gained new interests. NASA hosted a ‘Techno-Signature’ Workshop at the Lunar and Planetary Institute in Houston, Texas, on September 2018 to learn more about the current field and state of the art of searches for ‘Techno-Signatures’, and what role NASA might play in these searches in the future. One area in this field of research is the search for non-terrestrial artifacts in the Solar System. This AI challenge is aimed at developing ‘AI Toolbox’ for the Planetary Scientists to help in identifying non-terrestrial artifacts. Having a large dataset is crucial for the performance of the deep learning model. However, we can improve the performance of the model by augmenting the data we already have. Deep learning frameworks usually have built-in data augmentation utilities, but those can be inefficient or lacking some required functionality. In this article, I would like to make an overview of most popular image augmentation packages, designed specifically for machine learning, and demonstrate how to use these packages with PyTorch framework. When it comes to data analytics there are my reasons to move from your local computer to the cloud. Most prominently, you can run an indefinite number of machines without needing to own or maintain them. Furthermore, you can scale up and down as you wish in a matter of minutes. And if you choose to run t2.micro servers you can run for 750 hours a month for free within the first 12 months! After that it’s a couple of bucks per month and server. Alright, let’s get to it then! Understandably you won’t have time to read a ten minute article about RStudio Sever and Amazon Web Services after clicking a title that promised you a solution in 3 minutes. So I skip the formal introduction and cut to the chase. Uber released a new version of Ludwig with new features as well as some improvements to old once. If you don’t already know Uber’s Ludwig is a machine learning toolbox aimed at opening the world of machine learning to none-coders by providing a simple interface to create deep neural networks for lots of different applications. I already covered the basics of Uber’s Ludwig in two other articles. I released the first one right after the release of Uber’s Ludwig in February 2019. It covers the core principles and basics of Uber’s Ludwig. In the second article, I covered how to use Uber’s Ludwig for tabular, image and text data. Machine Learning operations (let’s call it mlOps under the current buzzword pattern xxOps) are quite different from traditional software development operations (devOps). One of the reasons is that ML experiments demand large dataset and model artifact besides code (small plain file). This post presents a solution to version control machine learning models with git and dvc (Data Version Control). Entry level articles on word vectors often contain examples of calculated analogies, such as king-man+woman=queen. Striking examples like this clearly have their place. They guide our interest towards similarity as one of the hidden treasures to delve for. In real data, however, analogies are often not so clear and easy to use. In the article, I have briefly presented Market Profile. I have covered why I think Market Profile is still relevant today and some reasoning on why I think in that way. I have also enumerated the three main classic books which cover the theory and a small excerpt of code on how to plot market profiles. A routine to get market profile is not presented because it is highly specific on how do you store your data, but in this example, a prototype in Python was build in just 50 lines. That is just one page of code. I love dancing! There, I said it. Even though I may not want to dance all the time, I do find myself often scrolling through my playlists in search of my most danceable songs. And here’s the thing, it has nothing to do with genres – at least not for me. But it has everything to do with the music. Most people reading this article have seen demonstrations of each probability distribution in Bayes Rule. Most people reading this have been formally introduced to the terms ‘posterior’, ‘prior’, and ‘likelihood’. If not, even better! I think that viewing Bayes Rule as an incremental learning rule would be a novel perspective for many. Further, I believe this perspective would give much better intuition for why we use the terms ‘posterior’ and ‘prior’. Finally, I think this perspective helps explain why I don’t think Bayesian statistics lead to any more inductive bias than Frequentist statistics. Let’s explore the complexity and vulnerability of IT infrastructure and how to build a modern IT infrastructure monitoring solution, using a combination of time series databases with machine learning. t-Distributed Stochastic Neighbor Embedding (t-SNE) is a dimensionality reduction technique used to represent high-dimensional dataset in a low-dimensional space of two or three dimensions so that we can visualize it. In contrast to other dimensionality reduction algorithms like PCA which simply maximizes the variance, t-SNE creates a reduced feature space where similar samples are modeled by nearby points and dissimilar samples are modeled by distant points with high probability. At a high level, t-SNE constructs a probability distribution for the high-dimensional samples in such a way that similar samples have a high likelihood of being picked while dissimilar points have an extremely small likelihood of being picked. Then, t-SNE defines a similar distribution for the points in the low-dimensional embedding. Finally, t-SNE minimizes the Kullback-Leibler divergence between the two distributions with respect to the locations of the points in the embedding. Recently, pre-trained models have achieved state-of-the-art results in various language understanding tasks, which indicates that pre-training on large-scale corpora may play a crucial role in natural language processing. Current pre-training procedures usually focus on training the model with several simple tasks to grasp the co-occurrence of words or sentences. However, besides co-occurring, there exists other valuable lexical, syntactic and semantic information in training corpora, such as named entity, semantic closeness and discourse relations. In order to extract to the fullest extent, the lexical, syntactic and semantic information from training corpora, we propose a continual pre-training framework named ERNIE 2.0 which builds and learns incrementally pre-training tasks through constant multi-task learning. Experimental results demonstrate that ERNIE 2.0 outperforms BERT and XLNet on 16 tasks including English tasks on GLUE benchmarks and several common tasks in Chinese. We propose a novel method for explaining the predictions of any classifier. In our approach, local explanations are expected to explain both the outcome of a prediction and how that prediction would change if ‘things had been different’. Furthermore, we argue that satisfactory explanations cannot be dissociated from a notion and measure of fidelity, as advocated in the early days of neural networks’ knowledge extraction. We introduce a definition of fidelity to the underlying classifier for local explanation models which is based on distances to a target decision boundary. A system called CLEAR: Counterfactual Local Explanations via Regression, is introduced and evaluated. CLEAR generates w-counterfactual explanations that state minimum changes necessary to flip a prediction’s classification. CLEAR then builds local regression models, using the w-counterfactuals to measure and improve the fidelity of its regressions. By contrast, the popular LIME method [15], which also uses regression to generate local explanations, neither measures its own fidelity nor generates counterfactuals. CLEAR’s regressions are found to have significantly higher fidelity than LIME’s, averaging over 45% higher in this paper’s four case studies. Continue Reading… ### Insurance data science : Networks [This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. At the Summer School of the Swiss Association of Actuaries, in Lausanne, I will start talking about networks and insurance this Friday. Slides are available online To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Continue Reading… ### Facebook awards$100,000 to 2019 Internet Defense Prize winners

The Internet Defense Prize is a partnership between USENIX and Facebook that aims to reward security research that meaningfully makes the internet more secure.

### Fun with progress bars: Fish, daggers and the Star Wars trench run

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you’re like me, when running a process through a loop you’ll add in counters and progress indicators. That way you’ll know if it will take 5 minutes or much longer. It’s also good for debugging to know when the code wigged-out.

This is typically what’s done. You take a time stamp at the start – start <- Sys.time(), print out some indicators at each iteration – cat(“iteration”, k, “// reading file”, file, “\n”) and print out how long it took at the end – print(Sys.time()-start). The problem is it will print out a new line at each time it is called, which is fine but ugly. You can reduce the number of lines printed by only printing out every 10th or 100th iteration e.g. if(k %% 10 == 0) ….

A simple way to make this better is instead of using "\n" for a new line use "\r" for carriage return. This will overwrite the same line which is much neater. It’s much more satisfying watching a number go up, or down, whichever way is the good direction. Try it out…

y <- matrix(0, nrow = 31, ncol = 5)
for(sim in 1:5){
y[1, sim] <- rnorm(1, 0, 8)
for(j in 1:30){
y[j+1, sim] <- y[j, sim] + rnorm(1) # random walk
cat("simulation", sim, "// time step", sprintf("%2.0f", j), "// random walk", sprintf(y[j+1, sim], fmt='% 6.2f'), "\r")
Sys.sleep(0.1)
}
}

## simulation 5 // time step 30 // random walk   8.97

The best way is to use the {progress} package. This package allows you to simply add running time, eta, progress bars, percentage complete as well as custom counters to your code. First decide on what counters you want and the format of the string. The function identifies counters by using a colon at the beginning of the label. Check the doco for built-in tokens.

To add your own token add the label to the format string and add the token to tick(). To make it pretty I recommend formatting digits with sprintf(). Here’s an example.

library(progress)

pb <- progress_bar$new(format = ":elapsedfull // eta :eta // simulation :sim // time step :ts // random walk :y [:bar]", total = 30*5, clear = FALSE) y <- matrix(0, nrow = 31, ncol = 5) for(sim in 1:5){ y[1, sim] <- rnorm(1, 0, 8) for(j in 1:30){ y[j+1, sim] <- y[j, sim] + rnorm(1) # random walk pb$tick(tokens = list(sim = sim, ts = sprintf("%2.0f", j), y = sprintf(y[j+1, sim], fmt='% 6.2f')))
Sys.sleep(0.1)
}
}

00:00:17 // eta  0s // simulation 5 // time step 30 // random walk -12.91 [====================================================]

You can also jazz it up with a bit of colour with {crayon}. Be careful with this, it doesn’t handle varying string lengths very well and can start a new line exploding your console.

library(crayon)
pb <- progress_bar$new(format = green$bold(":elapsedfull // eta :eta // simulation :sim // time step :ts // random walk :y [:bar]"), total = 30*5, clear = FALSE)
...

 00:00:17 // eta 0s // simulation 5 // time step 30 // random walk -12.91 [====================================================] 

That’s a much neater progress bar.

## But, I didn’t stop there…

Procrastination set in and creative tangents were followed. So, made a progress bar into a big fish which eats smaller fish … and made it green.

n <- 300
bar_fmt <- green$bold(":elapsedfull | :icon |") pb <- progress_bar$new(format = bar_fmt, total = n, clear = FALSE)
icon <- progress_bar_icon("fish", n, 75)
for(j in 1:n){
pb$tick(tokens = list( icon = token(icon, j) )) Sys.sleep(0.03) } Each fish represents 25% completion. Once they’re all gobbled up, the job is done. I also threw knives at boxes. Each box represents 20% completion. n <- 300 bar_fmt <- green$bold(":elapsedfull | :icon |")
pb <- progress_bar$new(format = bar_fmt, total = n, clear = FALSE) icon <- progress_bar_icon("dagger", n, 75) for(j in 1:n){ pb$tick(tokens = list(
icon = token(icon, j)
))
Sys.sleep(0.03)
}

And my personal favourite, the Star Wars trench run.

n <- 500
bar_fmt <- green$bold(":elapsedfull | :icon |") pb <- progress_bar$new(format = bar_fmt, total = n, clear = FALSE)
icon <- progress_bar_icon("tiefighter", n, 75)
for(j in 1:n){
pb$tick(tokens = list( icon = token(icon, j) )) Sys.sleep(0.03) } Ok… I have spent way too long on this! But at least it was fun. If you want to play around with it, feel free to download it from Git. devtools::install_github(“doehm/progressart”) The post Fun with progress bars: Fish, daggers and the Star Wars trench run appeared first on Daniel Oehm | Gradient Descending. To leave a comment for the author, please follow the link and comment on their blog: R – Daniel Oehm | Gradient Descending. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Continue Reading… ### Fresh from the Python Package Index Automagica Robot for Automagica – Smart Robotic Process Automation experta Expert Systems for Python jupyterlab-s3-browser A Jupyter Notebook server extension which acts a proxy for the S3 API. labeledclusters Python code for handling clusters of labeled, high-dimensional data pyneural Library for brain modeling and machine learning in Python 3 pypotree Potree visualization for jypyter notebooks pytorch-transformers-nightly-unofficial Repository of pre-trained NLP Transformer models: BERT, GPT & GPT-2, Transformer-XL, XLNet and XLM sparqlprog Execute logic program queries against a remote SPARQL endpoint tipboard2.0 Tipboard – a flexible solution for creating your dashboards. autoviml Automatically Build Variant Interpretable ML models fast! causalml Python Package for Uplift Modeling and Causal Inference with Machine Learning Algorithms cincoconfig Universal configuration file parser completejourney-py Data from R package completejourney contessa Data-quality framework EasyModels Command Line User Interface for finding pre-trained AI models Continue Reading… ### U. of Miami: Faculty Positions, with expertise in AI/Data Science/ML or related areas [Miami, FL] The positions require research and teaching expertise in AI/Data Science, or related areas including Data Extraction, Data Visualization, Machine Learning, and Intelligent Actuators. Continue Reading… ### Document worth reading: “Network reconstruction with local partial correlation: comparative evaluation” Over the past decade, various methods have been proposed for the reconstruction of networks modeled as Gaussian Graphical Models. In this work we analyzed three different approaches: the Graphical Lasso (GLasso), Graphical Ridge (GGMridge) and Local Partial Correlation (LPC). For the evaluation of the methods, we used high dimensional data generated from simulated random graphs (Erd\’os-R\’enyi, Barab\’asi-Albert, Watts-Strogatz). The performance was assessed through the Receiver Operating Characteristic (ROC) curve. In addition the methods were used for reconstruction of co-expression network, for differentially expressed genes in human cervical cancer data. LPC method outperformed the GLasso in most of the simulation cases, even though GGMridge produced better ROC curves then both other methods. LPC obtained similar outcomes as GGMridge in real data studies. Network reconstruction with local partial correlation: comparative evaluation Continue Reading… ### Data Driven Government – Speakers Highlights The lineup of experienced, thought-leading speakers at Data Driven Government, Sep 25 in Washington, DC, will explain how to use data and analytics to more effectively accomplish your mission, increase efficiency, and improve evidence-based policymaking. Continue Reading… ### ✚ Annotate Charts to Help Your Data Speak, Because the Data Has No Idea What It Is Doing (The Process #52) This week, we talk annotation and how it can make your charts more readable and easier to understand. Read More Continue Reading… ### How Dataquest Helped Mohammad Become a Machine Learning Engineer Learn how Mohammad went from zero background in data science to becoming a machine learning engineer with the help of Dataquest's data science courses. The post How Dataquest Helped Mohammad Become a Machine Learning Engineer appeared first on Dataquest. Continue Reading… ### Why does my academic lab keep growing? Andrew, Breck, and I are struggling with the Stan group funding at Columbia just like most small groups in academia. The short story is that to apply for enough grants to give us a decent chance of making payroll in the following year, we have to apply for so many that our expected amount of funding goes up. So our group keeps growing, putting even more pressure on us in the future to write more grants to make payroll. It’s a better kind of problem to have than firing people, but the snowball effect means a lot of work beyond what we’d like to be doing. Why does my academic lab keep growing? Here’s a simple analysis. For the sake of argument, let’s say your lab has a$1.5M annual budget. And to keep things simple, let’s suppose all grants are \$0.5M. So you need three per year to keep the lab afloat. Let’s say you have a well-oiled grant machine with a 40% success rate on applications.

Now what happens if you apply for 8 grants? There’s roughly a 30% chance you get fewer than the 3 grants you need, a 30% chance you get exactly the 3 grants you need, and a 40% chance you get more grants than you need.

If you’re like us, a 30% chance of not making payroll is more than you’d like, so let’s say you apply for 10 grants. Now there’s only a 20% chance you won’t make payroll (still not great odds!), a 20% chance you get exactly 3 grants, and a whopping 60% chance you wind up with 4 or more grants.

The more conservative you are about making payroll, the bigger this problem is.

Wait and See?

It’s not quite as bad as that analysis leads one to believe, because once a lab’s rolling, it’s usually working in two-year chunks, not one-year chunks. But that takes a while to build up that critical mass.

It would be great if you could apply and wait and see before applying again, but it’s not so easy. Most government grants have fixed deadlines, typically once or at most twice per year. The ones like NIH that have two submission periods/year have a tendency to no fund first applications. So if you don’t apply in a cycle, it’s usually at least another year before you can apply again. Sometimes special one-time-only opportunities with partners or funding agencies come up. We also run into problems like government shutdowns—I still have two NSF grants under review that have been backed up forever (we’ve submitted and heard back on other grants from NSF in the meantime).

The situation with Stan at Columbia

We’ve received enough grants to keep us going. But we have a bunch more in process, some of which we’re cautiously optimistic about. And we’ve already received about half a grant more than we anticipated, so we’re going to have to hire even if we don’t get the ones in process.

So if you know any postdocs or others who might want to work on the Stan language in OCaml and C++, let me know (carp@alias-i.com). A more formal job ad will be out out soon.

### How Concerned Should You be About Predictor Collinearity? It Depends…

Predictor collinearity (also known as multicollinearity) can be problematic for your regression models. Check out these rules of thumb about when, and when not, to be concerned.

### Jobs: 2 PhD and RA positions at University of Luxembourg

** Nuit Blanche is now on Twitter: @NuitBlog **

Kumar also sent me the following announcements for different positions:

Dear Igor,
I was wondering if you could post on Nuit-Blanche the announcement of the following Ph.D./R.A. positions at SnT, University of Luxembourg on signal processing for next-generation radar systems.
Thanks!
--
Regards,

Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email.

### Violence in Afghanistan last year was worse than in Syria

As NATO draws down forces, the Taliban have reclaimed much of the country

### Command Line Basics Every Data Scientist Should Know

Check out this introductory guide to completing simple tasks with the command line.

### Replication police methodological terrorism stasi nudge shoot the messenger wtf

(The link comes from Stuart Richie.) Sunstein later clarified:

I’ll take Sunstein’s word that he no longer thinks it’s funny to attack people who work for open science and say that they’re just like people who spread disinformation. I have no idea what Sunstein thinks the “grain of truth” is, but I guess that’s his problem.

Last word on this particular analogy comes from Nick Brown:

The bigger question

The bigger question is: What the hell is going on here? I assume that Sunstein doesn’t think that “good people doing good and important work” would be Stasi in another life. Also, I don’t know who are “the replication police.” After all, it’s Cass Sunstein and Brian Wansink, not Nick Brown, Anna Dreber, Uri Simonson, etc., who’ve been appointed to policymaking positions within the U.S. government.

What this looks like to me is a sort of alliance of celebrities. The so-called “replication police” aren’t police at all—unlike the Stasi, they have no legal authority or military power. Perhaps even more relevant, the replication movement is all about openness, whereas the defenders of shaky science are often shifty about their data, their analyses, and their review processes. If you want a better political analogy, how about this:

The open-science movement is like the free press. It’s not perfect, but when it works it can be one of the few checks against powerful people and institutions.

I couldn’t fit in Stasi or terrorists here, but that’s part of the point: Brown, Dreber, Simonsohn, etc., are not violent terrorists, and they’re not spreading disinformation. Rather, they’re telling, and disseminating truths that are unpleasant to some well-connected people.

Following the above-linked thread led me to this excerpt that Darren Dahly noticed from Sunstein’s book Nudge:

Jeez. Citing Wansink . . . ok, sure, back in the day, nobody knew that those publications were so flawed. But to describe Wansink’s experiments as “masterpieces” . . . what’s with that? I guess I understand, kind of. It’s the fellowship of the celebrities. Academic bestselling authors gotta stick together, right?

Several problems with science reporting, all in one place

I’d like to focus on one particular passage from Sunstein’s reporting on Wansink:

Wansink asked the recipients of the big bucket whether they might have eaten more because of the size of their bucket. Most denied the possibility, saying, “Things like that don’t trick me.” But they were wrong.

This quote illustrates several problems with science reporting:

1. Personalization; scientist-as-hero. It’s all Wansink, Wansink, Wansink. As if he did the whole study himself. As we now know, Wansink was the publicity man, not the detail man. I don’t know if these studies had anyone attending to detail, at least when it came to data collection and analysis. But, again, the larger point is that the scientist-as-hero narrative has problems.

2. Neglect of variation. Even if the study were reported and analyzed correctly, it could still be that the subset of people who said they were not influenced by the size of the bucket were not influenced. You can’t know, based on the data collected in this between-person study. We’ve discussed this general point before: it’s a statistical error to assume that an average pattern applies to everyone, or even to most people.

3. The claim that people are easily fooled. Gerd Gigerenzer has written about this a lot: There’s a lot of work being done by psychologists, economists, etc., sending the message that people are stupid and easily led astray by irrelevant stimuli. The implication is that democratic theory is wrong, that votes are determined by shark attacks, college football games, and menstrual cycles, so maybe we, the voters, can’t be reasoned with directly, we just have to be . . . nudged.

It’s frustrating to me how a commentator such as Sunstein is so ready to believe that participants in that popcorn experiments were “wrong” and then at the same time so quick to attack advocates for open science. If the open science movement had been around fifteen years ago, maybe Sunstein and lots of others wouldn’t have been conned. Not being conned is a good thing, no?

P.S. I checked Sunstein’s twitter feed to see if there was more on this Stasi thing. I couldn’t find anything, but I did notice this link to a news article he wrote, evaluating the president’s performance based on the stock market (“In terms of the Dow, 2018 was also pretty awful, with a 5.6 percent decline — the worst since 2008.”) Is that for real??

P.P.S. Look. We all make mistakes. I’m sure Sunstein is well-intentioned, just as I’m sure that the people who call us “terrorists” etc. are well-intentioned, etc. It’s just . . . openness is a good thing! To look at people who work for openness and analogize them to spies whose entire existence is based on secrecy and lies . . . that’s really some screwed-up thinking. When you’re turned around that far, it’s time to reassess, not just issue semi-apologies indicating that you think there’s a “grain of truth” to your attack. We’re all on the same side here, right?

P.P.P.S. Let me further clarify.

Bringing up Sunstein’s 2008 endorsement of Wansink is not a “gotcha.”

Back then, I probably believed all those sorts of claims too. As I’ve written in great detail, the past decade has seen a general rise in sophistication regarding published social science research, and there’s lots of stuff I believed back then, that I wouldn’t trust anymore. Sunstein fell for the hot hand fallacy fallacy too, but then again so did I!

Here’s the point. From one standpoint, Brian Wansink and Cass Sunstein are similar: They’re both well-funded, NPR-beloved Ivy League professors who’ve written best-selling books. They go on TV. They influence government policy. They’re public intellectuals!

But from another perspective, Wansink and Sunstein are completely different. Sunstein cares about evidence, Wansink shows no evidence of caring about evidence. When Sunstein learns he made a mistake, he corrects it. When Wansink learns he made a mistake, he muddies the waters.

I think the differences between Sunstein and Wansink are more important than the similarities. I wish Sunstein would see this too. I wish he’d see that the scientists and journalists who want to open things up, to share data, to reveal their own mistakes as well as those of others, are on his side. And the sloppy researchers, those who resist open data, open methods, and open discussion, are not.

To put it another way: I’m disturbed that an influential figure such as Sunstein thinks that the junk science produced Brian Wansink and other purveyors of unreplicable research are “masterpieces,” while he thinks it’s “funny” with “a grain of truth” to label careful, thoughtful analysts such as Brown, Dreber, Simonson as “Stasi.” Dude’s picking the wrong side on this one.

### EARL London – agenda highlights

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

There are so many wonderful EARL talks happening this year – it’s hard to highlight them all! But we thought we’d share some that the Mango team are really looking forward to:

### Ana Henriques, PartnerRe

Using R in Production at PartnerRe

Ana Henriques is the Analytics Tool Lead in PartnerRe’s Life & Health Department. Ana is now focused on business-side delivery of platforms and tools to support data science and related functions. Her talk will focus on the open source infrastructure supporting this process: version control, continuous integration, containerisation and container deployment and orchestration.

### Kevin Kuo, RStudio

Towards open collaboration in insurance analytics

Kevin is a software engineer at RStudio and is the founder of Kasa AI, a community organization for open research in insurance analytics. Kevin will be introducing Kasa AI, a not-for-profit community initiative for open research and software development for insurance analytics. Inspired by rOpenSci and Bioconductor, his team hopes to bring together the insurance community to solve the most impactful problems.

### Charlotte Wise, Essense

Beyond the average: a bayesian approach for setting media targets

Charlotte manages a small team of analysts at Essence, a global media agency and part of GroupM,
WPP. Her talk will cover how the team at Essense overcame the issue of reporting ROI on marketing campaigns by using a hierarchical bayesian model.

### Kasia Kulma, Mango Solutions

Integrating empathy in the Data Science process

Kasia Kulma is a Data Scientist at Mango Solutions and holds a PhD in evolutionary biology from Uppsala University. Kasia’s talk will demonstrate how empathy has a clearly defined role at every step of the Data Science process: from pitching project ideas and gathering requirements, to implementing solutions, informing and influencing stakeholders, and gauging the impact of the product.

### Mitchell Stirling, Heathrow Airport

Understanding Airport Baggage Demand through R modelling

Mitchell is a Senior Analyst at Heathrow Airport with seven years experience working in Operations, Commercial and Strategic positions. Heathrow Airport is entering a new phase of growth and the team there wanted to look at potential scenarios for occupancy and use of infrastructure to maximise existing assets and reduce the need for expensive capital works, early in the programme. To explore how these scenarios would impact the demand on baggage systems, Heathrow has worked with Mango to convert a legacy PERL script into an R package and make a number of improvements that cut down manual intervention, flag errors earlier, stabalise the process and allow for greater variation in key inputs.

There are plenty more speakers on the agenda for you to take a look at so why not join us in September for 3 days of R, learning, inspiration and fun!

Tickets available now.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Introducing the Plato Research Dialogue System: Building Conversational Applications at Uber’s Scale

While the process of building simple, domain-specific chatbots has gotten way easier, building large scale, multi-agent conversational applications remains a massive challenge. Recently, the Uber engineering team open sourced the Plato Research Dialogue System, which is the framework powering conversational agents across Uber’s different applications.

### Labeling, transforming, and structuring training data sets for machine learning

The O’Reilly Data Show Podcast: Alex Ratner on how to build and manage training data with Snorkel.

In this episode of the Data Show, I speak with Alex Ratner, project lead for Stanford’s Snorkel open source project; Ratner also recently garnered a faculty position at the University of Washington and is currently working on a company supporting and extending the Snorkel project. Snorkel is a framework for building and managing training data. Based on our survey from earlier this year, labeled data remains a key bottleneck for organizations building machine learning applications and services.

Ratner was a guest on the podcast a little over two years ago when Snorkel was a relatively new project. Since then, Snorkel has added more features, expanded into computer vision use cases, and now boasts many users, including Google, Intel, IBM, and other organizations. Along with his thesis advisor professor Chris Ré of Stanford, Ratner and his collaborators have long championed the importance of building tools aimed squarely at helping teams build and manage training data. With today’s release of Snorkel version 0.9, we are a step closer to having a framework that enables the programmatic creation of training data sets.

### Four short links: 15 August 2019

Data Businesses, Data Science Class, Tiny Mouse, and Training Bias

1. Making Uncommon Knowledge Common -- The Rich Barton playbook is building data content loops to disintermediate incumbents and dominate search, and then using this traction to own demand in their industries.
2. Data: Past, Present, and Future -- Data and data-empowered algorithms now shape our professional, personal, and political realities. This course introduces students both to critical thinking and practice in understanding how we got here, and the future we now are building together as scholars, scientists, and citizens. The way "Intro to Data Science" classes ought to be.
3. Clever Travel Mouse -- very small presenter tool, mouse and pointer.
4. Training Bias in "Hate Speech Detector" Means Black Speech is More Likely to be Censored (BoingBoing) -- The authors do a pretty good job of pinpointing the cause: the people who hand-labeled the training data for the algorithm were themselves biased, and incorrectly, systematically misidentified AAE writing as offensive. And since machine learning models are no better than their training data (though they are often worse!), the bias in the data propagated through the model.

### Predicting whether you are Democrat or Republican

The New York Times is in a quizzy mood lately. Must be all the hot weather. Sahil Chinoy shows how certain demographics tend towards Democrat or Republican, with a hook that that lets you put in your own information. A decision tree updates as you go.

Reminds of the Amanda Cox decision tree classic from 2008.

### The Layman’s Guide to Banking as a Service

Banking as a Service (BaaS) is the democratisation of financial capabilities that have fiercely been protected, isolated and hidden in silos for hundreds of years by banks. The fact that BaaS opens up banks’ capabilities and essentially empowers anyone to be able to create their own financial products, goes against

The post The Layman’s Guide to Banking as a Service appeared first on Dataconomy.

### If you did not already know

Label noise may handicap the generalization of classifiers, and it is an important issue how to effectively learn main pattern from samples with noisy labels. Recent studies have witnessed that deep neural networks tend to prioritize learning simple patterns and then memorize noise patterns. This suggests a method to search the best generalization, which learns the main pattern until the noise begins to be memorized. A natural idea is to use a supervised approach to find the stop timing of learning, for example resorting clean verification set. In practice, however, a clean verification set is sometimes not easy to obtain. To solve this problem, we propose an unsupervised method called limited gradient descent to estimate the best stop timing. We modified the labels of few samples in noisy dataset to be almost false labels as reverse pattern. By monitoring the learning progresses of the noisy samples and the reverse samples, we can determine the stop timing of learning. In this paper, we also provide some sufficient conditions on learning with noisy labels. Experimental results on CIFAR-10 demonstrate that our approach has similar generalization performance to those supervised methods. For uncomplicated datasets, such as MNIST, we add relabeling strategy to further improve generalization and achieve state-of-the-art performance. …

Focused Attention Network
Attention networks show promise for both vision and language tasks, by emphasizing relationships between constituent elements through appropriate weighting functions. Such elements could be regions in an image output by a region proposal network, or words in a sentence, represented by word embedding. Thus far, however, the learning of attention weights has been driven solely by the minimization of task specific loss functions. We here introduce a method of learning attention weights to better emphasize informative pair-wise relations between entities. The key idea is to use a novel center-mass cross entropy loss, which can be applied in conjunction with the task specific ones. We then introduce a focused attention backbone to learn these attention weights for general tasks. We demonstrate that the focused attention module leads to a new state-of-the-art for the recovery of relations in a relationship proposal task. Our experiments show that it also boosts performance for diverse vision and language tasks, including object detection, scene categorization and document classification. …

Dual User and Product Memory Network (DUPMN)
In sentiment analysis (SA) of product reviews, both user and product information are proven to be useful. Current tasks handle user profile and product information in a unified model which may not be able to learn salient features of users and products effectively. In this work, we propose a dual user and product memory network (DUPMN) model to learn user profiles and product reviews using separate memory networks. Then, the two representations are used jointly for sentiment prediction. The use of separate models aims to capture user profiles and product information more effectively. Compared to state-of-the-art unified prediction models, the evaluations on three benchmark datasets, IMDB, Yelp13, and Yelp14, show that our dual learning model gives performance gain of 0.6%, 1.2%, and 0.9%, respectively. The improvements are also deemed very significant measured by p-values. …

BM-GAN
Machine learning (ML) has progressed rapidly during the past decade and the major factor that drives such development is the unprecedented large-scale data. As data generation is a continuous process, this leads to ML service providers updating their models frequently with newly-collected data in an online learning scenario. In consequence, if an ML model is queried with the same set of data samples at two different points in time, it will provide different results. In this paper, we investigate whether the change in the output of a black-box ML model before and after being updated can leak information of the dataset used to perform the update. This constitutes a new attack surface against black-box ML models and such information leakage severely damages the intellectual property and data privacy of the ML model owner/provider. In contrast to membership inference attacks, we use an encoder-decoder formulation that allows inferring diverse information ranging from detailed characteristics to full reconstruction of the dataset. Our new attacks are facilitated by state-of-the-art deep learning techniques. In particular, we propose a hybrid generative model (BM-GAN) that is based on generative adversarial networks (GANs) but includes a reconstructive loss that allows generating accurate samples. Our experiments show effective prediction of dataset characteristics and even full reconstruction in challenging conditions. …

### Hardware realization of a CS-based MIMO radar

** Nuit Blanche is now on Twitter: @NuitBlog **

Kumar just sent me the following the other day:

Hi Igor,
We recently published our work on the hardware realization of a CS-based MIMO radar in IEEE Transactions on Aerospace and Electronic Systems. Your readers might be interested in this.
https://ieeexplore.ieee.org/abstract/document/8743424
--
Regards,
Kumar Vijay Mishra
Thanks Kumar  !

Here is the abstract:

We present a cognitive prototype that demonstrates a colocated, frequency-division-multiplexed, multiple-input multiple-output (MIMO) radar which implements both temporal and spatial sub-Nyquist sampling. The signal is sampled and recovered via the Xampling framework. Cognition is due to the fact that the transmitter adapts its signal spectrum by emitting only those subbands that the receiver samples and processes. Real-time experiments demonstrate sub-Nyquist MIMO recovery of target scenes with 87:5% spatio-temporal bandwidth reduction and signal-to-noise-ratio of -10 dB.

Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email.

### Big Data: Wrangling 4.6M Rows with dtplyr (the NEW data.table backend for dplyr)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Wrangling Big Data is one of the best features of the R programming language, which boasts a Big Data Ecosystem that contains fast in-memory tools (e.g. data.table) and distributed computational tools (sparklyr). With the NEW dtplyr package, data scientists with dplyr experience gain the benefits of data.table backend. We saw a 3X speed boost for dplyr!

We’ll go over the pros and cons and what you need to know to get up and running using a real world example of Fannie Mae Loan Performance that when combined is 4.6M Rows by 55 Columns – Not super huge, but enough to show off the new and improved dtplyr interface to the data.table package. We’ll end with a Time Study showing a 3X Speed Boost and Learning Recommendations to get you expertise fast.

If you like this article, we have more just like it in our Machine Learning Section of the Business Science Learning Hub.

## 1.0 The 30-Second Summary

We reviewed the latest advance in big data – The NEW dtplyr package, which is an interface to the high performance data.table library.

### Pros

• A 3X speed boost on the data joining and wrangling operations on a 4.6M ROw data set. The data wrangling operatiosn were performed in 6 seconds with dtplyr vs 18 seconds with dplyr.

• Performs inplace operations (:=), which vastly accelerates big data computations (see grouped time series lead() operation in Section 3.7 tutorial)

• Shows the data.table translation (this is really cool!)

### Cons

• For pure speed, you will need to learn all of data.table’s features including managing keys for fast lookups.

• In most cases, data.table will be faster than dtplyr because of overhead in the dtplyr translation process. However, we saw the difference to be very minimal.

• dtplyr is in experimental status currently – Tester’s wanted, file issues and requests here

### What Should You Learn?

Just starting out? Our recommendation is to learn dplyr first, then learn data.table, using dtplyr to bridge the gap

• Begin with dplyr, which has easy-to-learn syntax and works well for datasets of 1M Rows+.

• Learn data.table as you become comfortable in R. data.table is great for pure speed on data sets 50M Rows+. It has a different “bracketed” syntax that is streamlined but more complex for beginners. However, it has features like fast keyed subsetting and optimization for rolling joins that are out of the scope of this article.

• Use dtplyr as a translation tool to help bridge the gap between dplyr and data.table.

At a bare minimum – Learning dplyr is essential. Learn more about a system for learning dplyr in the Conclusions and Recommendations.

## 2.0 Big Data Ecosystem

R has an amazing ecosystem of tools designed for wrangling Big Data. The 3 most popular tools are dplyr, data.table, and sparklyr. We’ve trained hundreds of students on big data, and our students most common Big Data question is, “Which tool to use and when?”

Big Data: Data Wrangling Tools By Dataset Size

The “Big Data: Data Wrangling Tools by Dataset Size” graphic comes from Business Science’s Learning Lab 13: Wrangling 4.6M Rows (375 MB) of Financial Data with data.table where we taught students how to use data.table using Fannie Mae’s Financial Data Set. The graphic provides rough guidelines on when to use which tools by dataset row size.

1. dplyr (website) – Used for in-memory calculations. Syntax design and execution emphasizes readability over performance. Very good in most situations.

2. data.table (website) – Used for higher in-memory performance. Modifies data inplace for huge speed gains. Easily wrangles data in the range of 10M-50M+ rows.

3. sparklyr (website) – Distribute work across nodes (clusters) and performs work in parallel. Best used on big data (100M+ Rows).

## 3.0 Enter dtplyr: Boost dplyr with data.table backend

We now have a 4th tool that boosts dplyr using data.table as its backend. The good news is that if you are already familiar with dplyr, you don’t need to learn much to get the gains of data.table!

dtplyr: Bridging the Big Data Gap

The dtplyr package is a new front-end that wraps the High Performance data.table R package. I say new, but dtplyr has actually been around for over 2 years. However, the implementation recently underwent a complete overhaul vastly improving the functionality. Let’s check out the goals the package from the dtplyr website: https://dtplyr.tidyverse.org/.

dtplyr for Big Data

Here’s what you need to know:

• Goal: Increase speed of working with big data when using dplyr syntax

• Implementation: The dtplyr package enables the user to write dplyr code. Internally the package translates the code to data.table syntax. When run, the user gains the faster performance of data.table while being able to write the more readable dplyr code.

• Dev Status: The package is still experimental. This means that developers are still in the process of testing the package out, reporting bugs, and improving via feature requests.

# 4.0 Case Study – Wrangling 4.6M Rows (375MB) of Financial Data

Let’s try out the new and improved dtplyr + data.table combination on a large-ish data set.

### 4.1 Bad Loans Cost Millions (and Data Sets are MASSIVE)

Loan defaults cost organization millions. Further, the datasets are massive. This is a task where data.table and dtplyr will be needed as part of the preprocessing steps prior to building a Machine Learning Model.

### 4.2 Fannie Mae Data Set

The data used in the tutorial can be downloaded from Fannie Mae’s website. We will just be using the 2018 Q1 Acquisition and Performance data set.

A few quick points:

• The 2018 Q1 Performance Data Set we will use is 4.6M rows, enough to send Excel to a grinding hault, crashing your computer in the process.

• For dplyr, it’s actually do-able at 4.6M rows. However, if we were to do the full 25GB, we’d definitely want to use data.table to speed things up.

• We’ll do a series of common data manipulation operations including joins and grouped time series calculation to determine which loans become delinquent in the next 3 months.

### 4.3 Install and Load Libraries

In this tutorial, we’ll use the latest Development Version of dtplyr installed using devtools. All other packages used can be used by installing with install.packages().

Next, we’ll load the the following libraries with library():

• data.table: High-performance data wrangling
• dtplyr: Interface between dplyr and data.table
• tidyverse: Loads dplyr and several other useful R packages
• vroom: Fast reading of delimited files (e.g. csv) with vroom()
• tictoc: Simple timing operations
• knitr: Use the kable() function for nice HTML tables

We’ll read the data. The column-types are going to be pre-specified to assist in the loading process. The vroom() function does the heavy lifting.

First, I’ll setup the paths to the two files I’ll be reading:

1. Acquisitions_2018Q1.txt – Meta-data about each loan
2. Performance_2018Q1.txt – Time series data set with loan performance characteristics over time

For me, the files are stored in a folder called 2019-08-15-dtplyr. Your paths may be different depending on where the files are stored.

#### Read the Loan Acquisition Data

Note we specify the columns and types to improve the speed of reading the columns.

The loan acquisition data contains information about the owner of the loan.

loan_id original_channel seller_name original_interest_rate original_upb original_loan_term original_date first_pay_date original_ltv original_cltv number_of_borrowers original_dti original_borrower_credit_score first_time_home_buyer loan_purpose property_type number_of_units occupancy_status property_state zip primary_mortgage_insurance_percent product_type original_coborrower_credit_score mortgage_insurance_type relocation_mortgage_indicator
100001040173 R QUICKEN LOANS INC. 4.250 453000 360 2018-01-01 2018-03-01 65 65 1 28 791 N C PU 1 P OH 430 NA FRM NA NA N
100002370993 C WELLS FARGO BANK, N.A. 4.250 266000 360 2018-01-01 2018-03-01 80 80 2 41 736 N R PU 1 P IN 467 NA FRM 793 NA N
100005405807 R PMTT4 3.990 233000 360 2017-12-01 2018-01-01 79 79 2 48 696 N R SF 1 P CA 936 NA FRM 665 NA N
100008071646 R OTHER 4.250 184000 360 2018-01-01 2018-03-01 80 80 1 48 767 Y P PU 1 P FL 336 NA FRM NA NA N
100010739040 R OTHER 4.250 242000 360 2018-02-01 2018-04-01 49 49 1 22 727 N R SF 1 P CA 906 NA FRM NA NA N
100012691523 R OTHER 5.375 180000 360 2018-01-01 2018-03-01 80 80 1 14 690 N C PU 1 P OK 730 NA FRM NA NA N

Get the size of the acquisitions data set: 426K rows by 25 columns. Not that bad, but this is meta-data for the loan. The dataset we are worried about is the next one.

#### Read the Loan Performance Data

Let’s inspect the data. We can see that this is a time series where each “Loan ID” and “Monthly Reporting Period” go together.

loan_id monthly_reporting_period servicer_name current_interest_rate current_upb loan_age remaining_months_to_legal_maturity adj_remaining_months_to_maturity maturity_date msa current_loan_delinquency_status modification_flag zero_balance_code zero_balance_effective_date last_paid_installment_date foreclosed_after disposition_date foreclosure_costs prop_preservation_and_repair_costs asset_recovery_costs misc_holding_expenses holding_taxes net_sale_proceeds credit_enhancement_proceeds repurchase_make_whole_proceeds other_foreclosure_proceeds non_interest_bearing_upb principal_forgiveness_upb repurchase_make_whole_proceeds_flag foreclosure_principal_write_off_amount servicing_activity_indicator
100001040173 2018-02-01 QUICKEN LOANS INC. 4.25 NA 0 360 360 2048-02-01 18140 0 N   NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA   NA N
100001040173 2018-03-01   4.25 NA 1 359 359 2048-02-01 18140 0 N   NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA   NA N
100001040173 2018-04-01   4.25 NA 2 358 358 2048-02-01 18140 0 N   NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA   NA N
100001040173 2018-05-01   4.25 NA 3 357 357 2048-02-01 18140 0 N   NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA   NA N
100001040173 2018-06-01   4.25 NA 4 356 356 2048-02-01 18140 0 N   NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA   NA N
100001040173 2018-07-01   4.25 NA 5 355 355 2048-02-01 18140 0 N   NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA   NA N

Let’s check out the data size. We can see it’s 4.6M rows by 31 columns! Just a typical financial time series (seriously).

### 4.5 Convert to Tibbles to dtplyr Steps

Next, we’ll use the lazy_dt() function to convert the tibbles to dtplyr steps.

We can check the class() to see what we are working with.

The returned object is the first step in a dtplyr sequence.

Key Point:

• We are going to set up operations using a sequence of steps.
• The operations will not be fully evaluated until we convert to a data.table or tibble depending on our desired output.

### 4.6 Join the Data Sets

Our first data manipulation operation is a join. We are going to use the left_join() function from dplyr. Let’s see what happens.

The output of the joining operation is a new step sequence, this time a dtplyr_step_subset.

Next, let’s examine what happens when we print combined_dt to the console.

Key Points:

• The important piece is the data.table translation code, which we can see in the ouput: Call: _DT2[_DT1, on = .(loan_id)]

• Note that we haven’t excecuted the data manipulation operation. dtplyr smartly gives us a glimpse of what the operation will look like though, which is really cool.

### 4.7 Wrangle the Data

We’ll do a sequence of data wrangling operations:

• Select specific columns we want to keep
• Arrange by loan_id and monthly_reporting_period. This is needed to keep groups together and in the right time-stamp order.
• Group by loan_id and mutate to calculate whether or not loans become delinquent in the next 3 months.
• Filter rows with NA values from the newly created column (these aren’t needed)
• Reorder the columns to put the new calculated column first.

The final output is a dtplyr_step_group, which is just a sequence of steps.

If we print the final_output_dt object, we can see the data.table translation is pretty intense.

Key Point:

• The most important piece is that dtplyr correctly converted the grouped mutation to an inplace calculation, which is data.table speak for a super-fast calculation that makes no copies of the data. Here’s inplace calculation code from the dtplyr translation: [, :=(gt_1mo_behind_in_3mo = lead(current_loan_delinquency_status, n = 3) >= 1), keyby = .(loan_id)]

### 4.8 Collecting The Data

Note that up until now, nothing has been done to process the data – we’ve just created a recipe for data wrangling. We still need tell dtplyr to execute the data wrangling operations.

To implement all of the steps and convert the dtplyr sequence to a tibble, we just call as_tibble().

Key Point:

• Calling the as_tibble() function tells dtplyr to execute the data.table wrangling operations.

## 5.0 The 3X Speedup – Time Comparisons

Finally, let’s check the performance of the dplyr vs dtplyr vs data.table. We can seed a nice 3X speed boost!

## 6.0 Conclusions and Learning Recommendations

For Big Data wrangling, the dtplyr package represents a huge opportunity for data scientists to leverage the speed of data.table with the readability of dplyr. We saw an impressive 3X Speedup going from dplyr to using dtplyr for wrangling a 4.6M row data set. This just scratches the surface of the potential, and I’m looking forward to seeing dtplyr mature, which will help bridge the gap between the two groups of data scientists using dplyr and data.table.

For new data scientists coming from other tools like Excel, my hope is that you see the awesome potential of learning R for data analysis and data science. The Big Data capabilities represent a massive opportunity for you to bring your organization data science at scale.

### You just need to learn how to go from normal data to Big Data.

My recommendation is to start by learning dplyr – The popular data manipulation library that makes reading and writing R code very easy to understand.

Once you get to an intermediate level, learn data.table. This is where you gain the benefits of scaling data science to Big Data. The data.table package has a steeper learning curve, but learning it will help you leverage its full performance and scalability.

If you need to learn dplyr as fast as possible – I recommend beginning with our Data Science Foundations DS4B 101-R Course. The 101 Course is available as part of the 3-Course R-Track Bundle, a complete learning system designed to transform you from beginner to advanced in under 6-months. You will learn everything you need to become an expert data scientist.

## 7.0 Additional Big Data Guidelines

I find that students have an easier time picking a tool based on dataset row size (e.g. I have 10M rows, what should I use?). With that said, there are 2 factors that will influence whhich tools you need to use:

1. Are you performing Grouped and Iterative Operations? Performance even on normal data sets can become an issue if you have a lot of groups or if the calculation is iterative. A particular source of pain in the financial realm are rolling (window) calculations, which are both grouped and iterative within groups. In these situation, use high-performance C++ functions (e.g. Rolling functions from the roll package or RcppRoll package).

2. Do you have sufficient RAM? Once you begin working with gig’s of data, then you start to run out of memory (RAM). In these situations, you will need to work in chunks and parellelizing operations. You can do this with distributed sparklyr, which will perform some operations in parallel and distribute across nodes.

## 8.0 Recognizing the Developers

I’d like to take a quick moment to thank the developers of data.table and dplyr`. Without these two packages, Business Science probably would not exist. Thank you.

## 9.0 Coming Soon – Expert Shiny Apps Course!

I’m very excited to announce that Business Science has an Expert Shiny Course – Coming soon! Head over to Business Science University and create a free account. I will update you with the details shortly.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Document worth reading: “A Survey on Compressive Sensing: Classical Results and Recent Advancements”

Recovering sparse signals from linear measurements has demonstrated outstanding utility in a vast variety of real-world applications. Compressive sensing is the topic that studies the associated raised questions for the possibility of a successful recovery. This topic is well-nourished and numerous results are available in the literature. However, their dispersity makes it challenging and time-consuming for new readers and practitioners to quickly grasp its main ideas and classical algorithms, and further touch upon the recent advancements in this surging field. Besides, the sparsity notion has already demonstrated its effectiveness in many contemporary fields. Thus, these results are useful and inspiring for further investigation of related questions in these emerging fields from new perspectives. In this survey, we gather and overview vital classical tools and algorithms in compressive sensing and describe significant recent advancements. We conclude this survey by a numerical comparison of the performance of described approaches on an interesting application. A Survey on Compressive Sensing: Classical Results and Recent Advancements

### Distilled News

If you’re new to data science/machine learning, you probably wondered a lot about the nature and effect of the buzzword ‘feature normalization’. If you’ve read any Kaggle kernels, it is very likely that you found feature normalization in the data preprocessing section. So, what is data normalization and why the heck is it so valued by data practitioners?
In the last article, I explained the problems with including irrelevant or correlated features in model building. In this article, I’ll show you several neat implementations of selection algorithms that can be easily integrated into your project pipeline. Before diving into the detailed implementation, let’s go through the dataset I created. The dataset has 20 features, among which 5 contribute to the output and 2 are correlated.
1. Wrapper Feature Selection
2. Filtering Feature Selection
3. Embedded Feature Selection
I would like to start my first Machine Learning project. But I do not have tools. What should I do? What are the tools I could use? I will give you some hints and advices based on the toolbox I use. Of course there are more great tools but you should pick the ones you like. You should also use the tools that make your work productive which means you need to pay for them (which is not always the case – I do use free tools as well). The first and most important thing is that there are lots of options! Just pick what works for you! I have divided this post into several parts like the environments, the langauges and the libraries.
Computer vision is an interdisciplinary field that has been gaining huge amounts of traction in the recent years(since CNN) and self-driving cars have taken centre stage. Another integral part of computer vision is object detection. Object detection aids in pose estimation, vehicle detection, surveillance etc. The difference between object detection algorithms and classification algorithms is that in detection algorithms, we try to draw a bounding box around the object of interest to locate it within the image. Also, you might not necessarily draw just one bounding box in an object detection case, there could be many bounding boxes representing different objects of interest within the image and you would not know how many beforehand.
Recently I found myself working on a very large data set, one of those that you need to parallelize learning to make it feasible. I immediately thought of Uber’s Horovod. I had previously heard about it from a tech talk at Uber but had not really played around with it. I found it very interesting and a great framework, from the high-level simplifications to the algorithm that powers this framework. In this post, I’ll try to describe my understanding of the latter.
For all R zealots, we know that we can build any data product very efficiently using R. An automated trading system is not an exception. Whether you are doing high-frequency trading, day trading, swing trading, or even value investing, you can use R to build a trading robot that watches the market closely and trades the stocks or other financial instruments on your behalf. The benefits of a trading robot are obvious:
• A trading robot follows our pre-defined trading rules strictly. Other than human beings, the robot has no emotion involved when it makes trading decisions.
• A trading robot does not need rest (yet). The trading robot can watch the market price movement at every second across multiple financial instruments and execute the order immediately when the timing is correct.
Recently, Google Cloud AI Platform Serving (CAIP) added a new feature which Machine Learning (ML) practitioners can now use to deploy models with customized pre-processing pipelines and prediction routines using their favorite frameworks, all under one serverless microservice. In this blog post, I explain in detail how we at Wootric make use of this feature and mention a few nitty-gritties to be careful about, as the product is still in beta.
Are you an anomaly detection professional, or planning to advance modeling in anomaly detection? Then you should not miss this wonderful Python Outlier Detection (PyOD) Toolkit. It is a comprehensive module that has been featured by academic researches (see this summary) and the machine learning websites such as Towards Data Science, Analytics Vidhya, KDnuggets, etc.
… in which I discuss a workflow where you can start writing your contents on a jupyter notebook, create a reveal.js slide deck, and host it on github for presentations. This is for a very simple presentation that you can fully control yourself.
Recently I’ve been working with manufacturing customers (both OEM and CM) who want to jump on the bandwagon of machine learning. One common use case is to better detect products (or Device Under Test/DUT) that are defective in their production line. Using machine learning’s terminology, this falls under the problem of binary classification as a DUT can only pass or fail.
Artificial Intelligence. Well, it looks like this cutting-edge technology is now the most popular and at the same time the most decisive one for humanity. We are ceaselessly amazed at the AI capabilities and the effective way they can be used in almost any industry. Robots now is just like the airplane 100 years ago. So what’s next? This question raises many emotions starting from great interest, encouragement, the desire to be part of this process, and ending with the fear, complete confusion and ignorance. But what’s stopping you from sitting in one of the front seats of AI development and just don’t be a passive observer? You may assume getting started as a developer in AI is a long and hard path. Well, yes, but it doesn’t mean you can’t handle it. Let me say one word for those who doubt. Even if you don’t have any prior experience in programming, math, engineering, you can learn AI from scratch sitting at home and start applying your knowledge in practice, creating simple machine learning solutions and making first steps towards your new profession.
Part I. First Off, Gain Basic Skills Required to Start Learning AI
Part II. Start Learning AI – the Most Important Part
An overview of the evolution of NLP models for writing text.
• Markov Chains and N-grams
• Word Embeddings and Neural Language Models
• Recurrent Neural Networks
• Transformers
While using the fast.ai library built on top of PyTorch, I realized that I have never had to interact with an optimizer so far. Since fast.ai already deals with it when calling the fit_one_cycle method, I don’t have to parametrize the optimizer nor do I understand how it works. Adam is probably the most used optimizer in machine learning due to its simplicity and speed. It was developed in 2015 by Diederik Kingma and Jimmy Lei Ba and introduced in a paper called Adam : a method for stochastic optimization. As always, this blog post is a cheat sheet that I write to check my understanding of a notion. If you find something unclear or incorrect, don’t hesitate to write it in the comment section.
Could you imagine a future where computers made economic decisions rather than governments and central bankers? With all of the economic mishaps we’ve been seeing over the past decade, one could say it isn’t a particularly bad idea! Natural language processing could allow us to make more sense of the economy than we do currently. As it stands, investors and policymakers use index benchmarks and quantitative measures such as GDP growth to gauge economic health. That said, one potential application of NLP is to analyse text data (such as through major economic policy documents), and then ‘learn’ from such texts in order to generate appropriate economic policies independently of human intervention. In this example, an LSTM model is trained using text from a sample ECB policy document, in order to generate ‘new’ text data, with a view to revealing insights from such text that could be used for policy purposes. Specifically, a temperature hyperparameter is configured to control the randomness of text predictions generated, with the relevant text vectorized into sequences of characters, and the single-layer LSTM model then used for next character sampling – with a text generation loop then used to generate a block of text for each temperature (the higher the temperature, the more randomness induced in each block of text).
Style transfer is an excited sub-field of computer vision. It aims to transfer the style of one image onto another image, known as the content image. This technique allows us to synthesize new images combining the content and style of different images. Several developments have been made in this sub-field but the most notable initial work (neural style transfer) was done by Gatys et al. in 2015. Some of the results I got by applying this technique can be seen below.

### Finding out why

Python Library: causalml

Python Package for Uplift Modeling and Causal Inference with Machine Learning Algorithms

Article: Correlation is not causation

Why the confusion of these concepts has profound implications, from healthcare to business management. In correlated data, a pair of variables are related in that one thing is likely to change when the other does. This relationship might lead us to assume that a change to one thing causes the change in the other. This article clarifies that kind of faulty thinking by explaining correlation, causation, and the bias that often lumps the two together. The human brain simplifies incoming information, so we can make sense of it. Our brains often do that by making assumptions about things based on slight relationships, or bias. But that thinking process isn’t foolproof. An example is when we mistake correlation for causation. Bias can make us conclude that one thing must cause another if both change in the same way at the same time. This article clears up the misconception that correlation equals causation by exploring both of those subjects and the human brain’s tendency toward bias.
There is a growing literature in nonparametric estimation of the conditional average treatment effect given a specific value of covariates. However, this estimate is often difficult to interpret if covariates are high dimensional and in practice, effect heterogeneity is discussed in terms of subgroups of individuals with similar attributes. The paper propose to study treatment heterogeneity under the groupwise framework. Our method is simple, only based on linear regression and sample splitting, and is semiparametrically efficient under assumptions. We also discuss ways to conduct multiple testing. We conclude by reanalyzing a get-out-the-vote experiment during the 2014 U.S. midterm elections.
Counterfactual thinking describes a psychological phenomenon that people re-infer the possible results with different solutions about things that have already happened. It helps people to gain more experience from mistakes and thus to perform better in similar future tasks. This paper investigates the counterfactual thinking for agents to find optimal decision-making strategies in multi-agent reinforcement learning environments. In particular, we propose a multi-agent deep reinforcement learning model with a structure which mimics the human-psychological counterfactual thinking process to improve the competitive abilities for agents. To this end, our model generates several possible actions (intent actions) with a parallel policy structure and estimates the rewards and regrets for these intent actions based on its current understanding of the environment. Our model incorporates a scenario-based framework to link the estimated regrets with its inner policies. During the iterations, our model updates the parallel policies and the corresponding scenario-based regrets for agents simultaneously. To verify the effectiveness of our proposed model, we conduct extensive experiments on two different environments with real-world applications. Experimental results show that counterfactual thinking can actually benefit the agents to obtain more accumulative rewards from the environments with fair information by comparing to their opponents while keeping high performing efficiency.
Can an arbitrarily intelligent reinforcement learning agent be kept under control by a human user? Or do agents with sufficient intelligence inevitably find ways to shortcut their reward signal? This question impacts how far reinforcement learning can be scaled, and whether alternative paradigms must be developed in order to build safe artificial general intelligence. In this paper, we use an intuitive yet precise graphical model called causal influence diagrams to formalize reward tampering problems. We also describe a number of tweaks to the reinforcement learning objective that prevent incentives for reward tampering. We verify the solutions using recently developed graphical criteria for inferring agent incentives from causal influence diagrams.
In causal inference, a variety of causal effect estimands have been studied, including the sample, uncensored, target, conditional, optimal subpopulation, and optimal weighted average treatment effects. Ad-hoc methods have been developed for each estimand based on inverse probability weighting (IPW) and on outcome regression modeling, but these may be sensitive to model misspecification, practical violations of positivity, or both. The contribution of this paper is twofold. First, we formulate the generalized average treatment effect (GATE) to unify these causal estimands as well as their IPW estimates. Second, we develop a method based on Kernel Optimal Matching (KOM) to optimally estimate GATE and to find the GATE most easily estimable by KOM, which we term the Kernel Optimal Weighted Average Treatment Effect. KOM provides uniform control on the conditional mean squared error of a weighted estimator over a class of models while simultaneously controlling for precision. We study its theoretical properties and evaluate its comparative performance in a simulation study. We illustrate the use of KOM for GATE estimation in two case studies: comparing spine surgical interventions and studying the effect of peer support on people living with HIV.

### Course Announcement: Data Mining (36-462/662), Fall 2019

For the first time in ten years, I find myself teaching data mining in the fall. This means I need to figure out what data mining is in 2019. Naturally, my first stab at a syllabus is based on what I thought data mining was in 2009. Perhaps it's changed too little; nonetheless, I'm feeling OK with it at the moment*. I am sure the thoughtful and constructive suggestions of the Internet will only reinforce this satisfaction.

--- Seriously, suggestions are welcome, except for suggesting that I teach about neural networks, which I deliberately omitted because I am an out-of-date stick-in-the-mud reasons**.

*: Though I am not done selecting readings from the textbook, the recommended books, and sundry articles --- those will however come before the respective classes. I have been teaching long enough to realize that most students, particularly in a class like this, will read just enough of the most emphatically required material to think they know how to do the assignments, but there are exceptions, and anecdotally even some of thoe majority come back to the material later, and benefit from pointers. ^

**: On the one hand, CMU (now) has plenty of well-attended classes on neural networks and deep learning, so what would one more add? On the other, my admittedly cranky opinion is that we have no idea why the new crop works better than the 1990s version, and it's not always clear that they do work better than good old-fashioned machine learning, so there.