PyOD is an outlier detection package developed with a comprehensive API to support multiple techniques. This post will showcase Part 1 of an overview of techniques that can be used to analyze anomalies in data.
PyOD is an outlier detection package developed with a comprehensive API to support multiple techniques. This post will showcase Part 1 of an overview of techniques that can be used to analyze anomalies in data.
As we close in on its two-year anniversary, Spark NLP is proving itself a viable option for enterprise use.
In July 2016, I broached the idea for an NLP library aimed at Apache Spark users to my friend David Talby. A little over a year later, Talby and his collaborators announced the release of Spark NLP. They described the motivation behind the project in their announcement post and in this accompanying podcast that Talby and I wrote, as well as in this recent post comparing popular open source NLP libraries. [Full disclosure: I’m an advisor to Databricks, the startup founded by the team that originated Apache Spark.]
As we close in on the two-year anniversary of the project, I asked Talby where interest in the project has come from, and he graciously shared geo-demographic data of visitors to the project’s homepage:
Of the thousands of visitors to the site: 44% are from the Americas, 24% from Asia-Pacific, and the remaining 22% are based in the EMEA region.
Many of these site visitors are turning into users of the project. In our recent survey AI Adoption in the Enterprise, quite a few respondents signalled that they were giving Spark NLP a try. The project also garnered top prize—based on a tally of votes cast by Strata Data Conference attendees—in the open source category at the Strata Data awards in March.
There are many other excellent open source NLP libraries with significant numbers of users—spaCy, OpenNLP, Stanford CoreNLP, NLTK—but at the time when the project started, there seemed to be an opportunity for a library that appealed to users who already had Spark clusters (and needed a scalable solution). While the project started out targeting Apache Spark users, it has evolved to provide simple API’s that get things done in a few lines of code and fully hide Spark under the hood. The library’s Python API now has the most users. Installing Spark NLP is a one-liner operation using pip
or conda
for Python, or a single package pull on Java or Scala using maven, sbt, or spark-packages. The library’s documentation has also grown, and there are public online examples for common tasks like sentiment analysis, named entity recognition, and spell checking. Improvements in documentation, ease-of-use, and its production-ready implementation of key deep learning models, combined with speed, scalability, and accuracy has made Spark NLP a viable option for enterprises needing an NLP library.
For more on Spark NLP, join Talby and his fellow instructors for a three-hour tutorial, Natural language understanding at scale with Spark NLP, at the Strata Data Conference in New York City, September 23-26, 2019.
Related content:
Continue reading One simple chart: Who is interested in Spark NLP?.
Security Mnemonics, Evidence Might Work, Misinformation Inoculation, and Spoofing Presidential Alerts
Continue reading Four short links: 27 June 2019.
• Studies on the Software Testing Profession
• Astra Version 1.0: Evaluating Translations from Alloy to SMT-LIB
• Generation of Pseudo Code from the Python Source Code using Rule-Based Machine Translation
• Cognitive Knowledge Graph Reasoning for One-shot Relational Learning
• A Computational Analysis of Natural Languages to Build a Sentence Structure Aware Artificial Neural Network
• Individualized Group Learning
• Blockchain Games: A Survey
• Topic Modeling via Full Dependence Mixtures
• A Computationally Efficient Method for Defending Adversarial Deep Learning Attacks
• Training Neural Networks for and by Interpolation
• Post-Processing of High-Dimensional Data
• KCAT: A Knowledge-Constraint Typing Annotation Tool
• Unsupervised Neural Single-Document Summarization of Reviews via Learning Latent Discourse Structure and its Ranking
• Improved Sentiment Detection via Label Transfer from Monolingual to Synthetic Code-Switched Text
• A JIT Compiler for Neural Network Inference
• Unsupervised Image Noise Modeling with Self-Consistent GAN
• Improving Prediction Accuracy in Building Performance Models Using Generative Adversarial Networks (GANs)
• Deep Reinforcement Learning for Cyber Security
• Sub-policy Adaptation for Hierarchical Reinforcement Learning
• Contrastive Multiview Coding
• Reweighted Expectation Maximization
• Semantics to Space(S2S): Embedding semantics into spatial space for zero-shot verb-object query inferencing
• Learning to Forget for Meta-Learning
• Landslide Geohazard Assessment With Convolutional Neural Networks Using Sentinel-2 Imagery Data
• Solving Large-Scale 0-1 Knapsack Problems and its Application to Point Cloud Resampling
• On Horadam-Lucas sequence
• Inverse Problems, Regularization and Applications
• Deep Learning-Based Decoding of Constrained Sequence Codes
• Robust and interpretable blind image denoising via bias-free convolutional neural networks
• S3: A Spectral-Spatial Structure Loss for Pan-Sharpening Networks
• Measuring the Gain of a Micro-Channel Plate/Phosphor Assembly Using a Convolutional Neural Network
• Enriching Neural Models with Targeted Features for Dementia Detection
• On Longest Common Property Preserved Substring Queries
• Utilizing Edge Features in Graph Neural Networks via Variational Information Maximization
• Interpretable ICD Code Embeddings with Self- and Mutual-Attention Mechanisms
• Time scales in stock markets
• An image-driven machine learning approach to kinetic modeling of a discontinuous precipitation reaction
• Deep Network Approximation Characterized by Number of Neurons
• Understanding Human Context in 3D Scenes by Learning Spatial Affordances with Virtual Skeleton Models
• Localization in Gaussian disordered systems at low temperature
• Fractional cocoloring of graphs
• Scalable Community Detection over Geo-Social Network
• Character n-gram Embeddings to Improve RNN Language Models
• A Meta Approach to Defend Noisy Labels by the Manifold Regularizer PSDR
• Binomial edge ideals of cographs
• MIMA: MAPPER-Induced Manifold Alignment for Semi-Supervised Fusion of Optical Image and Polarimetric SAR Data
• Self-organized avalanches in globally-coupled phase oscillators
• Meta-heuristic for non-homogeneous peak density spaces and implementation on 2 real-world parameter learning/tuning applications
• Know What You Don’t Know: Modeling a Pragmatic Speaker that Refers to Objects of Unknown Categories
• Signed Hultman Numbers and Signed Generalized Commuting Probability in Finite Groups
• New constructions of asymptotically optimal codebooks via character sums over a local ring
• Illuminant Chromaticity Estimation from Interreflections
• Zeroth-Order Stochastic Block Coordinate Type Methods for Nonconvex Optimization
• Deep Variational Networks with Exponential Weighting for Learning Computed Tomography
• Sparse Approximate Factor Estimation for High-Dimensional Covariance Matrices
• Identifying Illicit Accounts in Large Scale E-payment Networks — A Graph Representation Learning Approach
• Lattice Transformer for Speech Translation
• Mir-BFT: High-Throughput BFT for Blockchains
• Associated Learning: Decomposing End-to-end Backpropagation based on Auto-encoders and Target Propagation
• Game-Theoretic Mixed $H_2/H_{\infty}$ Control with Sparsity Constraint for Multi-agent Networked Control Systems
• A Turing Kernelization Dichotomy for Structural Parameterizations of $\mathcal{F}$-Minor-Free Deletion
• Hypotheses testing and posterior concentration rates for semi-Markov processes
• Rate Balancing for Multiuser MIMO Systems
• Hypercontractivity for global functions and sharp thresholds
• Learning Spatio-Temporal Representation with Local and Global Diffusion
• Proactive Human-Machine Conversation with Explicit Conversation Goals
• The Consensus Number of a Cryptocurrency (Extended Version)
• Direct Sampling of Bayesian Thin-Plate Splines for Spatial Smoothing
• Spaceland Embedding of Sparse Stochastic Graphs
• Cut Selection For Benders Decomposition
• Information capacity of a network of spiking neurons
• Amur Tiger Re-identification in the Wild
• On discrete idempotent paths
• Variance Estimation For Online Regression via Spectrum Thresholding
• Counting integer points of flow polytopes
• On the first fall degree of summation polynomials
• $c^+$GAN: Complementary Fashion Item Recommendation
• On Edge-Partitioning of Complete Geometric Graphs into Plane Trees
• Noether theorem for action-dependent Lagrangian functions: conservation laws for non-conservative systems
• A review of available software for adaptive clinical trial design
• On Convex Graphs Having Plane Spanning Subgraph of Certain Type
• Non-convex optimization via strongly convex majoirziation-minimization
• Densities for piecewise deterministic Markov processes with boundary
• Vertex properties of maximum scattered linear sets of $\mathrm{PG}(1,q^n)$
• Antonym-Synonym Classification Based on New Sub-space Embeddings
• An Asymmetric Random Rado Theorem: 1-statement
• Decentralised Multi-Demic Evolutionary Approach to the Dynamic Multi-Agent Travelling Salesman Problem
• Dense Deformation Network for High Resolution Tissue Cleared Image Registration
• On the Complexity of an Augmented Lagrangian Method for Nonconvex Optimization
• Self-organized critical balanced networks: a unified framework
• Strategic customer behavior in a queueing system with alternating information structure
• Quasi-Stationary Distributions and Resilience: What to get from a sample?
• On the 4-color theorem for signed graphs
• Nearly all cacti are edge intersection hypergraphs of 3-uniform hypergraphs
• Use of Emergency Departments by Frail Elderly Patients: Temporal Patterns and Case Complexity
• A stabilized DG cut cell method for discretizing the linear transport equation
• Comparative Analysis of Switching Dynamics in Different Memristor Models
• A Semi-strong Perfect Digraph Theorem
• Grid R-CNN Plus: Faster and Better
• Rate of change of frequency under line contingencies in high voltage electric power networks with uncertainties
• Smooth digraphs modulo primitive positive constructability
• Lower a posteriori error estimates on anisotropic meshes
• Modeling and Verifying Cyber-Physical Systems with Hybrid Active Objects
• Slim DensePose: Thrifty Learning from Sparse Annotations and Motion Cues
• 2D Attentional Irregular Scene Text Recognizer
• Hilbert Space Fragmentation and Many-Body Localization
• Iterative subtraction method for Feature Ranking
• Curriculum Learning for Cumulative Return Maximization
• Dynamic Control of Functional Splits for Energy Harvesting Virtual Small Cells: a Distributed Reinforcement Learning Approach
• Querying a Matrix through Matrix-Vector Products
• Generating and Exploiting Probabilistic Monocular Depth Estimates
• Information-theoretic measures for non-linear causality detection: application to social media sentiment and cryptocurrency prices
• Distributed High-dimensional Regression Under a Quantile Loss Function
• Contrastive Bidirectional Transformer for Temporal Representation Learning
• Memory-Efficient Group-by Aggregates over Multi-Way Joins
• Nonlinear System Identification via Tensor Completion
• Modeling the Dynamics of PDE Systems with Physics-Constrained Deep Auto-Regressive Networks
• The iMaterialist Fashion Attribute Dataset
• Anderson localisation in stationary ensembles of quasiperiodic operators
• Graphs of bounded depth-$2$ rank-brittleness
• The rank of sparse random matrices
• Efficient calibration for high-dimensional computer model output using basis methods
• Semantic Change and Semantic Stability: Variation is Key
• Microscopic and macroscopic perspectives on stationary nonequilibrium states
• Hypersimplicial subdivisions
• Anti dependency distance minimization in short sequences. A graph theoretic approach
• Advance gender prediction tool of first names and its use in analysing gender disparity in Computer Science in the UK, Malaysia and China
• Knock Intensity Distribution and a Stochastic Control Framework for Knock Control
• Deep Unfolding for Communications Systems: A Survey and Some New Directions
• Training Image Estimators without Image Ground-Truth
• Machine Learning Based Analysis and Quantification of Potential Power Gain from Passive Device Installation
• Characteristic Power Series of Graph Limits
• A Low-Power Domino Logic Architecture for Memristor-Based Neuromorphic Computing
• UCAM Biomedical translation at WMT19: Transfer learning multi-domain ensembles
• On the Walks and Bipartite Double Coverings of Graphs with the same Main Eigenspace
• Modeling and Control of Combustion Phasing in Dual-Fuel Compression Ignition Engines
• Extending Eigentrust with the Max-Plus Algebra
• Egocentric affordance detection with the one-shot geometry-driven Interaction Tensor
• Topological Data Analysis for Arrhythmia Detection through Modular Neural Networks
• The Replica Dataset: A Digital Replica of Indoor Spaces
• Modeling and Interpreting Real-world Human Risk Decision Making with Inverse Reinforcement Learning
• Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index
• On bulk deviations for the local behavior of random interlacements
• Lower Bounds for Adversarially Robust PAC Learning
• On mean-field theories of dynamics in supercooled liquids
• Robust Regression for Safe Exploration in Control
• Telephonetic: Making Neural Language Models Robust to ASR and Semantic Noise
• Solution of the Unconditional Extremal Problem for a Linear-Fractional Integral Functional Depending on the Parameter
• Kernel and Deep Regimes in Overparametrized Models
• Overcoming Mean-Field Approximations in Recurrent Gaussian Process Models
• Robust linear domain decomposition schemes for reduced non-linear fracture flow models
• The Communication Complexity of Optimization
• On co-minimal pairs in abelian groups
• Goal-conditioned Imitation Learning
• Fractional Local Dimension
• Deep Reinforcement Learning for Industrial Insertion Tasks with Visual Inputs and Natural Rewards
• Concentration estimates for algebraic intersections
• Mask2Lesion: Mask-Constrained Adversarial Skin Lesion Image Synthesis
• Turing complete mechanical processor via automated nonlinear system design
• Multivariate polynomials for generalized permutohedra
• Spectra and eigenspaces from regular partitions of Cayley (di)graphs of permutation groups
• Detecting Photoshopped Faces by Scripting Photoshop
• Show, Match and Segment: Joint Learning of Semantic Matching and Object Co-segmentation
• Joint Concept Matching-Space Projection Learning for Zero-Shot Recognition
• Report-Sensitive Spot-Checking in Peer-Grading Systems
• Can generalised relative pose estimation solve sparse 3D registration?
• IntrinSeqNet: Learning to Estimate the Reflectance from Varying Illumination
• Learning Instance Occlusion for Panoptic Segmentation
• Dynamic PET cardiac and parametric image reconstruction: a fixed-point proximity gradient approach using patch-based DCT and tensor SVD regularization
• Hallucinating Bag-of-Words and Fisher Vector IDT terms for CNN-based Action Recognition
• Identify treatment effect patterns for personalised decisions
• Distributionally Robust Counterfactual Risk Minimization
Cognitive Knowledge Graph Reasoning for One-shot Relational Learning
Topic Modeling via Full Dependence Mixtures
A Computationally Efficient Method for Defending Adversarial Deep Learning Attacks
Training Neural Networks for and by Interpolation
Post-Processing of High-Dimensional Data
KCAT: A Knowledge-Constraint Typing Annotation Tool
Improved Sentiment Detection via Label Transfer from Monolingual to Synthetic Code-Switched Text
A JIT Compiler for Neural Network Inference
Unsupervised Image Noise Modeling with Self-Consistent GAN
Deep Reinforcement Learning for Cyber Security
Sub-policy Adaptation for Hierarchical Reinforcement Learning
Reweighted Expectation Maximization
Learning to Forget for Meta-Learning
“The hype about the possibilities and possible applications of artificial intelligence (AI) seems currently unlimited. The AI procedures and solutions are praised as true panaceas. However, when viewed soberly, they are just another tool in the toolbox of IT experts.” Marc Botha ( 21.01.2019 18:07 )
** Nuit Blanche is now on Twitter: @NuitBlog **
Studying the great convergence !
Unrolled neural networks emerged recently as an effective model for learning inverse maps appearing in image restoration tasks. However, their generalization risk (i.e., test mean-squared-error) and its link to network design and train sample size remains mysterious. Leveraging the Stein's Unbiased Risk Estimator (SURE), this paper analyzes the generalization risk with its bias and variance components for recurrent unrolled networks. We particularly investigate the degrees-of-freedom (DOF) component of SURE, trace of the end-to-end network Jacobian, to quantify the prediction variance. We prove that DOF is well-approximated by the weighted \textit{path sparsity} of the network under incoherence conditions on the trained weights. Empirically, we examine the SURE components as a function of train sample size for both recurrent and non-recurrent (with many more parameters) unrolled networks. Our key observations indicate that: 1) DOF increases with train sample size and converges to the generalization risk for both recurrent and non-recurrent schemes; 2) recurrent network converges significantly faster (with less train samples) compared with non-recurrent scheme, hence recurrence serves as a regularization for low sample size regimes.
Leaner Style Sheets (rless)
Converts LESS to CSS. It uses V8 engine, where LESS parser is run. Functions for LESS text, file or folder conversion are provided.
Orchestrate and Exchange Data with ‘EtherCalc’ Instances (ethercalc)
The ‘EtherCalc’ (https://ethercalc.net ) web application is a multi-user, collaborative spreadsheet t …
Dimensionality Assessment Using Minimum Rank Factor Analysis (EFA.MRFA)
Performs parallel analysis (Timmerman & Lorenzo-Seva, 2011 <doi:10.1037/a0023353>) and hull method (Lorenzo-Seva, Timmerman, & Kiers, 201 …
Finds the Archetypal Analysis of a Data Frame (archetypal)
Performs archetypal analysis by using Convex Hull approximation under a full control of all algorithmic parameters. It contains functions useful for fi …
Evolutionary algorithms (EAs) are population-based metaheuristics, originally inspired by aspects of natural evolution. Modern varieties incorporate a broad mixture of search mechanisms, and tend to blend inspiration from nature with pragmatic engineering concerns; however, all EAs essentially operate by maintaining a population of potential solutions and in some way artificially ‘evolving’ that population over time. Particularly well-known categories of EAs include genetic algorithms (GAs), Genetic Programming (GP), and Evolution Strategies (ES). EAs have proven very successful in practical applications, particularly those requiring solutions to combinatorial problems. EAs are highly flexible and can be configured to address any optimization task, without the requirements for reformulation and/or simplification that would be needed for other techniques. However, this flexibility goes hand in hand with a cost: the tailoring of an EA’s configuration and parameters, so as to provide robust performance for a given class of tasks, is often a complex and time-consuming process. This tailoring process is one of the many ongoing research areas associated with EAs. Evolutionary Algorithms
Data Engineering – Basics of Apache Airflow – Build Your First Pipeline
Understanding different Loss Functions for Neural Networks
How to build a custom Dataset for Tensorflow
How Data Stories Help You Solve Analytics Challenges and Drive Impact – by Design
How Google Uses Reinforcement Learning to Train AI Agents in the Most Popular Sport in the World
An Introduction to Bayesian Inference
Correlation and Causation – How alcohol affects life expectancy
Understanding Value Of Correlations In Data Science Projects
Demystifying Tensorflow Time Series: Local Linear Trend
Open-domain question answering with DeepPavlov
Applications of MCMC for Cryptography and Optimization
End-to-end learning, the (almost) every purpose ML method
Blackman-Tukey Spectral Estimator in R!
There are two definitions of the power spectral density (PSD). Both definitions are mathematically nearly identical and define a function that describes the distribution of power over the frequency components in our data set. The periodogram PSD estimator is based on the first definition of the PSD (see periodogram post). The Blackman-Tukey spectral estimator (BTSE) is based on the second definition. The second definition says, find the PSD by calculating the Fourier transform of the autocorrelation sequence (ACS). In R this definition is written as
PSD <- function(rxx) {
fft(rxx)
}
where fft is the R implementation of the fast Fourier transform, rxx is the autocorrelation sequence (ACS), the k’th element of the ACS rxx[k] = E[x[0]x[k]], k -infinity to +infinity, and E is the expectation operator. The xx in rxx[k] is a reminder r is a correlation between x and itself. The rxx[k]s are sometimes called lags. The ACS has the propriety that rxx[-k]=rxx[k]*, where * is the complex conjugate. In the post, we will only use real numbers and I’ll drop the * from here forward.
So, to find the PSD we just calculate rxx and take its fft! Unfortunately, in practice, we cannot do this. Calculating the expected value requires the probability density function (PDF) of x, which we don’t know and we need an infinite amount of data, which we don’t have. So, we can’t calculate the PSD: we’re doomed!
No, we are not doomed. We can’t calculate the PSD, but we estimate it! We can derive an estimator for the PSD from the definition of the PSD. First, we replace rxx with an estimate of rxx. We replace the expected value, which gives the exact rxx, with an average, which gives us an estimate of rxx. The E[x[0]x[k]] is replaced with (1/N)(x[1]x[1+k]+x[2]x[2+k]+…+x[N-1-k]x[N-1]+x[N-k]x[N]), where N is the number of data samples. For example; if k=0, then rxx[k]=(1/N)*sum(x*x). In R code the estimate is written as
lagEstimate <-function(x,k,N=length(x)){
(1/N)*sum(x[1:(N-k)]*x[(k+1):N])
}
If we had an infinite amount of data, N=infinity, we could use lagEstimate to estimate the entire infinite ACS. Unfortunately we don’t have an infinite amount of data, even if we did it wouldn’t fit into a computer. So, we can only estimate a finite amount of ASC elements. The function below calculates lags 0 to kMax.
Lags <-function(x,kMax) {
sapply(0:kMax,lagEstimate,x=x)
}
Before we can try these functions out we need data. In this case the data came from a random process with the PSD plotted in the figure below. The x-axis is normalized frequency(frequency divided by the sampling rate). So, if the sampling rate was 1000 Hz, you could multiply the normalized frequency by 1000 Hz and then the frequency axis would read 0 Hz to 1000 Hz. The y-axis in in dB (10log10(amplitude)). You can see six large sharp peaks in the plot and a gradual dip towards 0 Hz and then back up. Some of the peaks are close together and will be hard to resolve.
The data produced by the random process is plotted below. This is the data we will use through this post.
Let’s calculate the the ACS up to the 5th lag using the data.
Lags(x,kMax=5)
## [1] 6.095786 -1.368336 3.341608 1.738122 -1.737459 3.651765
A kMax of 5 gives us 6 lags: {r[0], r[1], r[2], r[3], r[4], r[5]}. These 6 lags are not an ACS, but are part of an ACS.
We used Lags to estimate the positive lags up to kMax, but the ACS is even sequence, r[-k]=r[k] for all k. So, let’s write a function to make a sequence consisting of lags from r[-kMax] to r[kMax]. This is a windowed ACS, values outside of the +/- kMax are replaced with 0. Where it won’t cause confusion, I’ll refer to the windowed ACS, as the ACS.
acsWindowed <- function(x,kMax,Nzero=0){
rHalf <- c(Lags(x,kMax),rep(0,Nzero))
c(rev(rHalf[2:length(rHalf)]),rHalf)
}
Let’s try this function out.
acsW <- acsWindowed(x,9)
In the figure below you can see the r[0] lag, the maximum, is plotted in the middle of the plot.
The ACS in the figure above is how the ACS is usually plotted in textbooks. In textbooks the sum in the Fourier transform ranges from -N/2 to (N-1)/2. So, the r[0] lag should be in the center of the plot. In R the sum in the Fourier transform ranges from 1 to N, so the 0’th lag has to be first. We could just make the sequence in R form, but it is often handy to start in textbook from and switch to R form. We can write a function to make switching from textbook to R easy.
Textbook2R <- function(x,N=length(x),foldN=ceiling(N/2)) {
c(x[foldN:N],x[1:(foldN-1)])
}
Notice in the figure below the maximum lag r[0], is plotted at the beginning.
Let’s imagine we have an infinite amount of data and used it to estimated and infinite number of ACS lags. Let’s call that sequence rAll. We make a windowed ACS by setting rW=rAll*W, where W=1 for our 9 lags and 0 everywhere else. W is called the rectangular window, because, as you can see in the plot below, it’s plot looks like a rectangle. By default when we estimate a finite number of lags we are using a rectangular window.
W <- c(rep(0,9),rep(1,9),rep(0,9))
The reason we can not use a rectangular window is its Fourier Transform is not always positive. As you can see in the plot below there are several values below zero, indicated with dotted line. Re() functions removes some small imaginary numbers due to numerical error, some imaginary dust we have to sweep up.
FFT_W <- Re(fft(Textbook2R(W)))
Even though the fft of the ACS rAll is positive , the produce rAll and a rectangular window might not be positive! The Bartlett window is a simple window whos fft is positive.
BartlettWindow <- function(N,n=seq(0, (N-1))) {
1 - abs( (n-(N-1)/2) / ( (N-1)/2) )
}
Wb <- BartlettWindow(19)
As you can see in the plot below the Fourier transform of the Bartlett window is positive.
WbFft <- Re(fft(Textbook2R(Wb)))
Now that we can estimate the ACS and window our estimate, we are ready to estimate the PSD of our data. The BTSE is written as
Btse <- function(rHat,Wb) {
Re(fft(rHat*Wb))
}
Note the Re() is correcting for numerical error.
In the first example we use a 19 point ACS lag sequence.
rHat <- Textbook2R(acsWindowed(x,kMax=9))
Wb <- Textbook2R(BartlettWindow(length(rHat)))
Pbtse9 <- Btse(rHat,Wb)
In the figure below is the BTSE calculated with a maximum lag of 9. The dotted lines indicate the locations of the peaks in the PSD we are trying to estimate. The estimate with a maximum lage of only 9 produces a poor estimate.
We calculate a new estimate with a maximum lag of 18.
rHat <- Textbook2R(acsWindowed(x,kMax=18))
Wb <- Textbook2R(BartlettWindow(length(rHat)))
Pbtse18 <- Btse(rHat,Wb)
The next estimate is made with a maximum lag of 18. This estimate is better, the peaks around 0.4 and 0.6 are not resolved. We still need to increase the maximum lag.
Finally we increase the maximum lag to 65 and recalculate the estimate.
rHat <- Textbook2R(acsWindowed(x,kMax=65))
Wb <- Textbook2R(BartlettWindow(length(rHat)))
Pbtse65 <- Btse(rHat,Wb)
This finial estimate is very good. All six peaks are resolved and the location of our estimated peaks are very close to the true peak locations locations.
Could we use 500 lags in the BTSE? In this case we could, since we have a lot of data, but the higher lags get estimated with less data and therefore have more variance. Using the high variance lags will produce a higher variance estimate.
Are there other ways to improve the BTSE other than using more lags? Yes! There are a few other ways. For instance, we could zero pad the lags. Basically add zeros to the end of our lag sequence. This will make the fft, in the BTSE estimator, evaluate the estimate at more frequencies and we will be able to see more details in the estimated PSD.
Also keep in mind there are other PSD estimation methods that do better on other PSD features. For instance, if you we more interested finding deep nulls rather than peaks, a moving average PSD estimation would be better.
Gap-Measure Tests with Applications to Data Integrity Verification
Learning Representations by Maximizing Mutual Information Across Views
Learning Interpretable Shapelets for Time Series Classification through Adversarial Regularization
Big-Data Clustering: K-Means or K-Indicators?
NodeDrop: A Condition for Reducing Network Size without Effect on Output
A Hybrid RNN-HMM Approach for Weakly Supervised Temporal Action Segmentation
Correctness Verification of Neural Networks
Transforming Complex Sentences into a Semantic Hierarchy
Neural networks grown and self-organized by noise
Episodic Memory in Lifelong Language Learning
Random Path Selection for Incremental Learning
A Case for Backward Compatibility for Human-AI Teams
Towards Fair and Decentralized Privacy-Preserving Deep Learning with Blockchain
Back Attention Knowledge Transfer for Low-resource Named Entity Recognition
Learning Attention-based Embeddings for Relation Prediction in Knowledge Graphs
Robust Mean Estimation with the Bayesian Median of Means
Attributed Graph Clustering via Adaptive Graph Convolution
Progressive Self-Supervised Attention Learning for Aspect-Level Sentiment Analysis
Toward Building Conversational Recommender Systems: A Contextual Bandit Approach
An Efficient Graph Convolutional Network Technique for the Travelling Salesman Problem
An interpretable machine learning framework for modelling human decision behavior
Universal Boosting Variational Inference
Kinetic Market Model: An Evolutionary Algorithm
A Novel Hyperparameter-free Approach to Decision Tree Construction that Avoids Overfitting by Design
The Extended Dawid-Skene Model: Fusing Information from Multiple Data Schemas
Emotion-Cause Pair Extraction: A New Task to Emotion Analysis in Texts
Privacy-preserving Crowd-guided AI Decision-making in Ethical Dilemmas
CCMI : Classifier based Conditional Mutual Information Estimation
The world is witnessing an unprecedented growth of cyber-physical systems (CPS), which are foreseen to revolutionize our world {via} creating new services and applications in a variety of sectors such as environmental monitoring, mobile-health systems, intelligent transportation systems and so on. The {information and communication technology }(ICT) sector is experiencing a significant growth in { data} traffic, driven by the widespread usage of smartphones, tablets and video streaming, along with the significant growth of sensors deployments that are anticipated in the near future. {It} is expected to outstandingly increase the growth rate of raw sensed data. In this paper, we present the CPS taxonomy {via} providing a broad overview of data collection, storage, access, processing and analysis. Compared with other survey papers, this is the first panoramic survey on big data for CPS, where our objective is to provide a panoramic summary of different CPS aspects. Furthermore, CPS {require} cybersecurity to protect {them} against malicious attacks and unauthorized intrusion, which {become} a challenge with the enormous amount of data that is continuously being generated in the network. {Thus, we also} provide an overview of the different security solutions proposed for CPS big data storage, access and analytics. We also discuss big data meeting green challenges in the contexts of CPS. Big Data Meet Cyber-Physical Systems: A Panoramic Survey
• There is no general AI: Why Turing machines cannot pass the Turing test
• Sionnx: Automatic Unit Test Generator for ONNX Conformance
• Boosting Few-Shot Visual Learning with Self-Supervision
• Task Agnostic Continual Learning via Meta Learning
• Warping Resilient Time Series Embeddings
• Explore, Propose, and Assemble: An Interpretable Model for Multi-Hop Reading Comprehension
• Functional Singular Spectrum Analysis
• Better Code, Better Sharing:On the Need of Analyzing Jupyter Notebooks
• Representation Learning for Words and Entities
• GluonTS: Probabilistic Time Series Models in Python
• Learning Curves for Deep Neural Networks: A Gaussian Field Theory Perspective
• COMET: Commonsense Transformers for Automatic Knowledge Graph Construction
• Pairwise Fairness for Ranking and Regression
• Tensor Canonical Correlation Analysis
• Dynamic Time Scan Forecasting
• Linear Distillation Learning
• Random Tessellation Forests
• A Brief Introduction to Manifold Optimization
• Factorized Mutual Information Maximization
• Reinforcement Learning of Spatio-Temporal Point Processes
• EKT: Exercise-aware Knowledge Tracing for Student Performance Prediction
• Automatically Evaluating Balance: A Machine Learning Approach
• Support Vector Machine-Based Fire Outbreak Detection System
• Tackling Climate Change with Machine Learning
• Deep Two-path Semi-supervised Learning for Fake News Detection
• Generating Long and Informative Reviews with Aspect-Aware Coarse-to-Fine Decoding
• Traffic signal control optimization under severe incident conditions using Genetic Algorithm
• A Focus on Neural Machine Translation for African Languages
• Calibration, Entropy Rates, and Memory in Language Models
• Towards Resilient UAV: Escape Time in GPS Denied Environment with Sensor Drift
• Focal Loss based Residual Convolutional Neural Network for Speech Emotion Recognition
• Deep Learning based Emotion Recognition System Using Speech Features and Transcriptions
• Cued@wmt19:ewc&lms
• Translating Translationese: A Two-Step Approach to Unsupervised Machine Translation
• Temporally-Biased Sampling Schemes for Online Model Management
• The NMF problem and lattice-subspaces
• Understanding artificial intelligence ethics and safety
• Privacy-Preserving Deep Visual Recognition: An Adversarial Learning Framework and A New Dataset
• The complexity of the vertex-minor problem
• Hysteresis, neural avalanches and critical behaviour near a first-order transition of a spiking neural network
• Parameterized Structured Pruning for Deep Neural Networks
• Detection and Correction of Cardiac MR Motion Artefacts during Reconstruction from K-space
• Model Order Reduction by Proper Orthogonal Decomposition
• Global optimization using Sobol indices
• Vispi: Automatic Visual Perception and Interpretation of Chest X-rays
• On the joint distribution of cyclic valleys and excedances over conjugacy classes of $\mathfrak{S}_{n}$
• Voronoi conjecture for five-dimensional parallelohedra
• Tackling Partial Domain Adaptation with Self-Supervision
• Manifold Graph with Learned Prototypes for Semi-Supervised Image Classification
• Towards Real-Time Head Pose Estimation: Exploring Parameter-Reduced Residual Networks on In-the-wild Datasets
• Model-Free Practical Cooperative Control for Diffusively Coupled Systems
• Sorted Top-k in Rounds
• Unsupervised Monocular Depth and Ego-motion Learning with Structure and Semantics
• Is Deep Learning an RG Flow?
• A Multiscale Visualization of Attention in the Transformer Model
• Monotonic Infinite Lookback Attention for Simultaneous Machine Translation
• Choosing agile or plan-driven enterprise resource planning (ERP) implementations — A study on 21 implementations from 20 companies
• A Model to Search for Synthesizable Molecules
• Estimation of the Shapley value by ergodic sampling
• Continual and Multi-Task Architecture Search
• Flying far and fast: the distribution of distant hypervelocity star candidates from Gaia DR2 data
• Handwritten Text Segmentation via End-to-End Learning of Convolutional Neural Network
• Nonparametric Identification and Estimation with Independent, Discrete Instruments
• Reinforcement Knowledge Graph Reasoning for Explainable Recommendation
• Understanding Vulnerability of Communities in Complex Networks
• When to use parametric models in reinforcement learning?
• A Bayesian Hierarchical Model for Evaluating Forensic Footwear Evidence
• Tensor train optimization for mathematical model of social networks
• Bootstrapping Upper Confidence Bound
• Multitask Learning for Network Traffic Classification
• Search on the Replay Buffer: Bridging Planning and Reinforcement Learning
• A Simple Text Mining Approach for Ranking Pairwise Associations in Biomedical Applications
• Higher extensions for gentle algebras
• Image-Adaptive GAN based Reconstruction
• LAEO-Net: revisiting people Looking At Each Other in videos
• Multicolor Ramsey numbers of cycles in Gallai colorings
• The Tandem Duplication Distance is NP-hard
• Visual Wake Words Dataset
• Developing an improved Crystal Graph Convolutional Neural Network framework for accelerated materials discovery
• Differential Imaging Forensics
• Artificial Intelligence Enabled Material Behavior Prediction
• Does Learning Require Memorization? A Short Tale about a Long Tail
• Presence-Only Geographical Priors for Fine-Grained Image Classification
• Critical Point Finding with Newton-MR by Analogy to Computing Square Roots
• Efficient Exploration via State Marginal Matching
• Keeping Notes: Conditional Natural Language Generation with a Scratchpad Mechanism
• Matrix Mittag–Leffler distributions and modeling heavy-tailed risks
• MOPED: Efficient priors for scalable variational inference in Bayesian deep neural networks
• Neural Network Models for Stock Selection Based on Fundamental Analysis
• Equality and difference of quenched and averaged large deviation rate functions for random walks in random environments without ballisticity
• Sub-Goal Trees — a Framework for Goal-Directed Trajectory Prediction and Optimization
• HPLFlowNet: Hierarchical Permutohedral Lattice FlowNet for Scene Flow Estimation on Large-scale Point Clouds
• Identifying and Predicting Parkinson’s Disease Subtypes through Trajectory Clustering via Bipartite Networks
• The effect of dead time on randomly sampled power spectral estimates
• Reduction of noise and bias in randomly sampled power spectra
• Optimizing Redundancy Levels in Master-Worker Compute Clusters for Straggler Mitigation
• Optimal low rank tensor recovery
• Permutation-based uncertainty quantification about a mixing distribution
• Data Conversion in Area-Constrained Applications: the Wireless Network-on-Chip Case
• Uncovering Dominant Social Class in Neighborhoods through Building Footprints: A Case Study of Residential Zones in Massachusetts using Computer Vision
• Conditional Monte Carlo for Reaction Networks
• GANPOP: Generative Adversarial Network Prediction of Optical Properties from Single Snapshot Wide-field Images
• Opportunistic Beamforming in Wireless Network-on-Chip
• Competing Bandits in Matching Markets
• Work Design and Job Rotation in Software Engineering: Results from an Industrial Study
• Linking geospatial data with Geo-L — analysis and experiments of big data readiness of common technologies
• A steady-state stability analysis of uniform synchronous power grid topologies
• Brouwer’s conjecture holds asymptotically almost surely
• Modeling functional resting-state brain networks through neural message passing on the human connectome
• Neural Graph Evolution: Towards Efficient Automatic Robot Design
• The Herbarium Challenge 2019 Dataset
• E3: Entailment-driven Extracting and Editing for Conversational Machine Reading
• Meta-Learning via Learned Loss
• Eye Contact Correction using Deep Neural Networks
• Compositional generalization through meta sequence-to-sequence learning
• Loop Programming Practices that Simplify Quicksort Implementations
• Assisted Excitation of Activations: A Learning Technique to Improve Object Detectors
• Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian
• Neural Arabic Question Answering
• Nonintrusive proper generalised decomposition for parametrised incompressible flow problems in OpenFOAM
• Topology-Preserving Deep Image Segmentation
• Analyzing the Limitations of Cross-lingual Word Embedding Mappings
• A Countrywide Traffic Accident Dataset
• A Joint Graph Based Coding Scheme for the Unsourced Random Access Gaussian Channel
• Flexible Modeling of Diversity with Strongly Log-Concave Distributions
• Fast, reliable and unrestricted iterative computation of Gauss–Hermite and Gauss–Laguerre quadratures
• Synthetic QA Corpora Generation with Roundtrip Consistency
• Efficient Evaluation-Time Uncertainty Estimation by Improved Distillation
• From asymptotic properties of general point processes to the ranking of financial agents
• Memory Augmented Neural Network Adaptive Controller for Strict Feedback Nonlinear Systems
• Lower Bounds for the Happy Coloring Problems
• Copulas as High-Dimensional Generative Models: Vine Copula Autoencoders
• Folding Bilateral Backstepping Output-Feedback Control Design For an Unstable Parabolic PDE
• Jacobian Policy Optimizations
• CoopSubNet: Cooperating Subnetwork for Data-Driven Regularization of Deep Networks under Limited Training Budgets
• Factors for the Generalisation of Identity Relations by Neural Networks
• N-dimensional Heisenberg’s uncertainty principle for fractional Fourier transform
• Coordinated Path Following Control of Fixed-wing Unmanned Aerial Vehicles
• Money Cannot Buy Everything: Trading Infinite Location Data Streams with Bounded Individual Privacy Loss
• Fixed-Parameter Tractability of Graph Deletion Problems over Data Streams
• Efficiency of maximum likelihood estimation for a multinomial distribution with known probability sums
• Near-Optimal Glimpse Sequences for Improved Hard Attention Neural Network Training
• Combinatorially equivalent hyperplane arrangements
• Figurative Usage Detection of Symptom Words to Improve Personal Health Mention Detection
• A Comparison of Word-based and Context-based Representations for Classification Problems in Health Informatics
• On Feasibility and Flexibility Operating Regions of Virtual Power Plants and TSO/DSO interfaces
• Selective prediction-set models with coverage guarantees
• Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets
There is no general AI: Why Turing machines cannot pass the Turing test
Sionnx: Automatic Unit Test Generator for ONNX Conformance
Boosting Few-Shot Visual Learning with Self-Supervision
Task Agnostic Continual Learning via Meta Learning
Warping Resilient Time Series Embeddings
Explore, Propose, and Assemble: An Interpretable Model for Multi-Hop Reading Comprehension
Functional Singular Spectrum Analysis
Better Code, Better Sharing:On the Need of Analyzing Jupyter Notebooks
Representation Learning for Words and Entities
GluonTS: Probabilistic Time Series Models in Python
Learning Curves for Deep Neural Networks: A Gaussian Field Theory Perspective
COMET: Commonsense Transformers for Automatic Knowledge Graph Construction
Pairwise Fairness for Ranking and Regression
Tensor Canonical Correlation Analysis
A Brief Introduction to Manifold Optimization
Factorized Mutual Information Maximization
Reinforcement Learning of Spatio-Temporal Point Processes
On June 6th, our team hosted a live webinar—Managing the Complete Machine Learning Lifecycle: What’s new with MLflow—with Clemens Mewald, Director of Product Management at Databricks.
Machine learning development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models.
To solve for these challenges, last June, we unveiled MLflow, an open source platform to manage the complete machine learning lifecycle. Most recently, we announced the General Availability of Managed MLflow on Databricks and the MLflow 1.0 Release.
In this webinar, we reviewed new and existing MLflow capabilities that allow you to:
We demonstrated these concepts using notebooks and tutorials from our public documentation so that you can practice at your own pace. If you’d like free access Databricks Unified Analytics Platform and try our notebooks on it, you can access a free trial here.
Toward the end, we held a Q&A and below are the questions and answers.
Q: Apart from having the trouble of all the set-up, is there any missing features/disadvantages of using MLflow on-premises rather than in the cloud on Databricks?
Databricks is very committed to the open source community. Our founders are the original creators of Apache Spark^{TM} – a widely adopted open source unified analytics engine – and our company still actively maintains and contributes to the open source Spark code. Similarly, for both Delta Lake and MLflow, we’re equally committed to help the open source community benefit from these products, as well as provide an out-of-the-box managed version of these products.
When we think about features to provide on the open source or the managed version of Delta Lake or MLflow, we don’t think about whether we should hold back a feature on a version or another. We think about what additional features we can provide that only make sense in a hosted and managed version for enterprise users. Therefore, all the benefits you get from managed MLflow on Databricks are that you don’t need to worry about the setup, managing the servers, and all these integrations with the Databricks Unified Analytics Platform that makes it seamlessly work with the rest of the workflow. Visit http://databricks.com/mlflow to learn more.
Q: Does MLflow 1.0 supports Windows?
Yes, we added support to run the MLflow client on windows. Please see our release notes here.
Q: Is MLflow complements or competes with TensorFlow?
It’s a perfect complement. You can train TensorFlow models and log the metrics and models with MLflow.
Q: How many different metrics can we track using MLflow? Are there any restrictions imposed on it?
MLflow doesn’t impose any limits on the number of metrics you can track. The only limitations are in the backend that is used to store those metrics.
Q: How to parallelize models training with MLflow?
MLflow is agnostic to the ML framework you use to train the model. If you use TensorFlow or PyTorch you can distribute your training jobs with for example HorovodRunner and use MLflow to log your experiments, runs, and models.
Q: Is there a way to bulk extract the MLflow info to perform operational analytics (e.g. how many training runs were there in the last quarter. How many people are training models etc.)?
We are working on a way to more easily extract the MLflow tracking metadata into a format that you can do data science with, e.g. into a pandas dataframe.
Q: Is it possible to train and build a MLflow model using a platform (e.g. like Databricks using TensorFlow with PySpark) and then reuse that MLflow model in another platform (for example in R using RStudio) to score any input?
The MLflow Model format and abstraction allows using any MLflow model from anywhere you can load them. E.g., you can use the python function flavor to call the model from any Python library, or the r function flavor to call it as an R function. MLflow doesn’t rewrite the models into a new format, but you can always expose an MLflow model as a REST endpoint and then call it in a language agnostic way.
Q: To serve a model, what are the options to deploy outside of databricks, eg. Sagemaker. Do you have any plans to deploy as AWS Lambdas?
We provide several ways you can deploy MLflow models, including Amazon SageMaker, Microsoft Azure ML, Docker Containers, Spark UDF and more… See this page for a list. To give one example of how to use MLflow models with AWS Lambda, you can use the python function flavor which enables you to call the model from anywhere you can call a Python function.
Q: Can MLflow be used with python programs outside of Databricks?
Yes, MLflow is an open source product and can be found on GitHub and PyPi.
Q: What is the pricing model for Databricks?
Please see https://databricks.com/product/pricing
Q: Hi how do you see MLflow evolving in relation to Airflow?
We are looking into ways to support multi-step workflows. One way we could do this is by using Airflow. We haven’t made these decisions yet.
Q: Suggestions for deploying multi-step models for example ensemble of several base models.
Right now you can deploy those as MLflow models by writing code to ensemble other models. E.g. similar to how the multi-step workflow example is implemented.
Q: Does MLflow provide a framework to do feature engineering on data?
Not specifically, but you can use any other framework together with MLflow.
To get started with MLflow, follow the instructions at mlflow.org or check out the alpha release code on Github. We’ve also recently created a Slack channel for MLflow as well for real time questions, and you can follow @MLflowOrg on Twitter. We are excited to hear your feedback!
--
Try Databricks for free. Get started today.
The post What’s new with MLflow? On-Demand Webinar and FAQs now available! appeared first on Databricks.
Also: Data Science Jobs Report 2019; Harvard CS109 #DataScience Course, Resources #Free and Online; Google launches TensorFlow; Mastering SQL for Data Science
zoNNscan
The training of deep neural network classifiers results in decision boundaries which geometry is still not well understood. This is in direct relation with classification problems such as so called adversarial examples. We introduce zoNNscan, an index that is intended to inform on the boundary uncertainty (in terms of the presence of other classes) around one given input datapoint. It is based on confidence entropy, and is implemented through sampling in the multidimensional ball surrounding that input. We detail the zoNNscan index, give an algorithm for approximating it, and finally illustrate its benefits on four applications, including two important problems for the adoption of deep networks in critical systems: adversarial examples and corner case inputs. We highlight that zoNNscan exhibits significantly higher values than for standard inputs in those two problem classes. …
Compositional Coding for Collaborative Filtering
Efficiency is crucial to the online recommender systems. Representing users and items as binary vectors for Collaborative Filtering (CF) can achieve fast user-item affinity computation in the Hamming space, in recent years, we have witnessed an emerging research effort in exploiting binary hashing techniques for CF methods. However, CF with binary codes naturally suffers from low accuracy due to limited representation capability in each bit, which impedes it from modeling complex structure of the data. In this work, we attempt to improve the efficiency without hurting the model performance by utilizing both the accuracy of real-valued vectors and the efficiency of binary codes to represent users/items. In particular, we propose the Compositional Coding for Collaborative Filtering (CCCF) framework, which not only gains better recommendation efficiency than the state-of-the-art binarized CF approaches but also achieves even higher accuracy than the real-valued CF method. Specifically, CCCF innovatively represents each user/item with a set of binary vectors, which are associated with a sparse real-value weight vector. Each value of the weight vector encodes the importance of the corresponding binary vector to the user/item. The continuous weight vectors greatly enhances the representation capability of binary codes, and its sparsity guarantees the processing speed. Furthermore, an integer weight approximation scheme is proposed to further accelerate the speed. Based on the CCCF framework, we design an efficient discrete optimization algorithm to learn its parameters. Extensive experiments on three real-world datasets show that our method outperforms the state-of-the-art binarized CF methods (even achieves better performance than the real-valued CF method) by a large margin in terms of both recommendation accuracy and efficiency. …
Coarse-to-Fine Network (C2F Net)
Deep neural networks have seen tremendous success for different modalities of data including images, videos, and speech. This success has led to their deployment in mobile and embedded systems for real-time applications. However, making repeated inferences using deep networks on embedded systems poses significant challenges due to constrained resources (e.g., energy and computing power). To address these challenges, we develop a principled co-design approach. Building on prior work, we develop a formalism referred to as Coarse-to-Fine Networks (C2F Nets) that allow us to employ classifiers of varying complexity to make predictions. We propose a principled optimization algorithm to automatically configure C2F Nets for a specified trade-off between accuracy and energy consumption for inference. The key idea is to select a classifier on-the-fly whose complexity is proportional to the hardness of the input example: simple classifiers for easy inputs and complex classifiers for hard inputs. We perform comprehensive experimental evaluation using four different C2F Net architectures on multiple real-world image classification tasks. Our results show that optimized C2F Net can reduce the Energy Delay Product (EDP) by 27 to 60 percent with no loss in accuracy when compared to the baseline solution, where all predictions are made using the most complex classifier in C2F Net. …
Warping Resilient Time Series Embedding
Time series are ubiquitous in real world problems and computing distance between two time series is often required in several learning tasks. Computing similarity between time series by ignoring variations in speed or warping is often encountered and dynamic time warping (DTW) is the state of the art. However DTW is not applicable in algorithms which require kernel or vectors. In this paper, we propose a mechanism named WaRTEm to generate vector embeddings of time series such that distance measures in the embedding space exhibit resilience to warping. Therefore, WaRTEm is more widely applicable than DTW. WaRTEm is based on a twin auto-encoder architecture and a training strategy involving warping operators for generating warping resilient embeddings for time series datasets. We evaluate the performance of WaRTEm and observed more than $20\%$ improvement over DTW in multiple real-world datasets. …
This is a book review. It is by Phil Price. It is not by Andrew.
The book is Good To Go: What the athlete in all of us can learn from the strange science of recovery. By Christine Aschwanden, published by W.W. Norton and Company. The publisher offered a copy to Andrew to review, and Andrew offered it to me as this blog’s unofficial sports correspondent.
tldr: This book argues persuasively that when it comes to optimizing the recovery portion of the exercise-recover-exercise cycle, nobody knows nuthin’ and most people who claim to know sumthin’ are wrong. It’s easy to read and has some nice anecdotes. Worth reading if you have a special interest in the subject, otherwise not. Full review follows.
The book is about ‘recovery’. In the context of the book, recovery is what you do between bouts of exercise; or, if you prefer, exercise is what you do between periods of recovery. The book has great blurbs. “A tour de force of great science journalism”, writes Nate Silver (!). “…a definitive tour through a bewildering jungle of scientific and pseudoscientific claims…”, writes David Epstein. “…Aschwanden makes the mid-boggling world of sports recovery a hilarious adventure”, says Olympic gold medal skier Jessie Diggins. With blurbs like these I was expecting a lot…although once I realized Aschwanden works at FiveThirtyEight, I downweighted the Silver blurb appropriately. Even so, I expected too much: the book is fine but ultimately rather unsatisfying. It is fairly interesting and sometimes amusing, but there’s only so much any author can do with the subject given the current state of knowledge, which is this: other than getting enough sleep and eating enough calories, nobody knows for sure what helps athletes recover between events or training sessions better than just living a normal life. The book is mostly just 300 pages of elucidating and amplifying that disappointing state of knowledge.
The author, Aschwanden, went to a lot of trouble, conducting hundreds of interviews, reading hundreds of scientific or quasi-scientific or pseudo-scientific papers, and in some cases subjecting herself to treatments in the interest of journalism (a sensory deprivation tank! Tom Brady’s magic pajamas! A cryogenic chamber!…) If the subject of athletic recovery is especially interesting to you then hey, it’s a fine book, plenty of good stuff in there, $30 well spent for a two or three hours of information and amusement.
For readers of this blog — and maybe for everybody — the first couple of chapters are the best ones, because they provide some insights that can apply to many areas of science and statistical analysis. The first chapter explains what happened when Aschwanden became interested in whether beer is good, bad, or indifferent as a ‘recovery drink.’ She has a friend who was a researcher at a lab that researches human performance and when she brought the question to him he was enthusiastic about studying this issue, so they did. They designed and performed a study that is typical (all too typical) of studies that address this kind of issue: only 10 participants, with tests spanning a couple of days. Do some hard exercise, then drink regular beer or non-alcoholic beer. The next day “run to exhaustion” (following a standard protocol) and afterwards drink whichever beverage you didn’t drink the previous day. The next day, run to exhaustion again. Quantify the time to run to exhaustion at the specified level of effort. The study found no ‘statistically significant’ difference between real beer and fake beer for the contestants as a whole, or for male participants, but for women there was a statistically significant difference, with performance better after real beer! And for men there was a difference large enough to be substantively important if true, but not statistically significant. Fortunately, Aschwanden is no dummy. She doesn’t mention the ‘garden of forking paths’, but does recognize some other major methodological problems with the study. As she puts it: “There was only one problem: I didn’t believe it. Trust me — I wanted our study to show that beer was great for runners, really, I did. Yet my experience as a participant… left me feeling skeptical of our result, and the episode helped me understand and recognize some pitfalls that I’ve found to be common among sports performance studies.” And then she gives a few paragraphs that do a great job of illustrating why it is really hard to get objective measures of human performance for a study like this, and why it matters. The upshot is that in this study the researchers are fitting noise. And the problems that came up in this study are common, indeed nearly ubiquitous, in this sort of research. Disappointingly, even this chapter doesn’t show any data or any hard numbers. There’s not a plot or table in the book.
The second chapter discusses hydration (and over-hydration), starting off with a discussion of the creation and marketing of Gatorade and going on from there. As with every chapter, Aschwanden mixes anecdotes, history, and results from scientific studies, and pulls everything together with her own evaluation. It’s a good formula and makes for a readable book. The hydration chapter is typical in that it illustrates the extent to which marketing and a smattering of scientific research led to a widespread perception among athletes that later turned out either not to be true or to be more nuanced than was first thought. In fact, according to Aschwanden and backed up by many studies she cites, in contrast to what many athletes and coaches have believed over the past thirty years or so our bodies can tolerate moderate dehydration with very little problem, and optimal hydration for a many athletes and many activities turns out to involve a lot less drinking than most people (including most athletes and coaches) thought for decades. And it’s probably better to be rather dehydrated than to be rather over-hydrated.
I can’t resist adding my own little hydration story. A couple of years ago, on a very hot day I rode my bike on a hilly route to our local mountain (Mount Diablo), rode up it and back down, stopped at the bottom for food, and then rode back home. The ride was about 100 miles and the temperature was in the high nineties. Each time I stopped for water, I filled and chugged one of my water bottles, then filled both of them and continued on, draining both bottles by the time I got to the next water stop. Knowing the capacity of my bottles and the number of times I stopped, it’s easy to count how much I drunk that day. I also had a large milkshake and a coke at my lunch stop, as well as something like a pound of food. On that day I drank 17 pounds of fluid. I weighed myself when I got home and found that I had lost 8 pounds. I had not urinated during the day, and didn’t do so for several hours after I got home. What’s the point of telling you this? I dunno; I just think it’s really interesting. In one long day I sweated or exhaled more than 25 pounds of water! I still find it hard to believe..although it does jibe with one of Gatorade’s early marketing campaigns, which promoted the idea that athletes should drink 40 ounces per hour, and not necessarily on a brutally hot day. But Aschwanden has both anecdotes and studies in which successful athletes drank much less, and about some athletes getting in bad medical trouble by drinking too much. The point isn’t that endurance athletes shouldn’t drink, it’s that they shouldn’t obsess about drinking as long as they don’t get too thirsty. Aschwanden says it has long been conventional wisdom that in an athletic event you should drink before you’re thirsty, and drink enough that you never become thirsty, but there’s actually no evidence that that leads to better performance than simply drinking when you feel like it.
Another chapter covers the current fad for ice baths, cryogenic chambers, ice-water compression boots, and so on. No real evidence they help, no real evidence they hurt.
Another chapter covers the current fad for infrared treatments (heat baths, saunas, ‘infrared’ saunas, Tom Brady’s magic thermal underwear, etc.) No real evidence they help, no real evidence they hurt. Oh, and not only have the claims about thermal underwear not been evaluated by the Food and Drug Administration, they’ve apparently never been evaluated by a physicist either, because they’re ridiculous. If you buy the underwear you deserve to be mocked, and you should be. If no one else will do it for you, send me an email and I’ll do the mocking.
Massage? No real evidence it helps, no real evidence it hurts. That said, I intend to continue to get occasional massages from my next door neighbor, Cyrus Poitier, who is an elite sports masseur. He travels with the men’s national wrestling team and the women’s swim team, and is one of the US Olympic Team’s masseurs. Like most of Cyrus’s clients, I don’t go to Cyrus for feel-good massages — in fact they are usually quite painful — but instead I go when I have some soreness or tightness that I haven’t been able to get rid of on my own, and I do think his massages help. But do they really, in the sense of helping me perform better athletically, and, if so, how much? According to Aschwanden there’s no evidence, or only weak evidence, that they help at all. But I would swear they help me! And he has many elite athletes as clients. So are all of us wrong? Well, maybe we are, or maybe we’re right that the massages help but the effect is rather small. Or maybe they help the performance of those of us with some musculo-skeletal issues but harm the performance of people with other issues. The right way to answer this is with data, and according to Aschwanden the existing data aren’t adequate to the task.
Every ‘recovery modality’ in the book has a bunch of proponents, including some elite athletes who swear by it. Every one of the modalities has a bunch of individuals or companies promoting it and telling people it works, usually buttressed by questionable studies like Aschwanden’s beer study. And just about every one of the recovery methods or substances has some skeptics who think it’s all hype.
And ultimately that’s the problem with Aschwanden’s book, though it’s not her fault: at the moment it’s impossible to know what works, and how well. She says this herself, towards the end of the book: “After exploring a seemingly endless array of recovery aids, I’ve come to think of them as existing on a sort of evidence continuum. At one end you’ve got sleep — the most potent recovery tool ever recovered (and one that money can’t buy). At the other end lies a pile of faddish products like hydrogen water and oxygen inhalers, which an ounce of common sense can tell you are mostly useless… Most things, however, lie somewhere in the vast middle — promising but unproven.” For someone like me, that’s a good reason to ignore just about all of the unproven stuff: even if something would improve my performance fairly substantially — let’s say a 5% increase in speed on my hardest bike rides — that wouldn’t change my life in a noticeable way. But for a competitive athlete, even 0.5% could be the difference between a gold medal and being off the podium, or being a pro vs an amateur who never quite breaks through. So there are always going to be people promoting this stuff, and there will always be athletes willing to give it a try.
Although firm conclusions about effectiveness are hard to come by, there’s plenty of interesting stuff in the book. For example, one of the many anecdotes concerns sprinter Usain Bolt. At the 2008 Olympics in Beijing, Bolt wasn’t happy with any of the unfamiliar food available to him at the athletes’ cafeteria, so he went to McDonalds and ate Chicken McNuggets. Every day. For lunch and dinner. (He also ate a small amount of greens drenched in salad dressing). According to Bolt’s memoir, he ate about 100 nuggets every 24 hours, adding up to about 1000 chicken nuggets over the course of the ten days he competed in the 100m, 200m, and 4x100m relay (with multiple heats in each, plus the finals). He won gold medals in all of them. As Aschwanden says, “Those chicken nuggets were adequate, if not ideal, fuel to power him through his nine heats, and to help him recover his energy in between them. Feeling satiated and not worrying about gastrointestinal issues are surely worth a lot to an athlete preparing for his most important events of the season. Would Bolt have performed better eating some other recovery foods? Maybe. The better question is: How much difference would it make?”
By the way, a popular saying among the kind of people who read this blog is “The plural of ‘anecdote’ is not ‘data.'” I liked that saying too, the first time I heard it, but the more I think about it the less I agree with it. Of course it’s literally true that ‘data’ is not the plural of ‘anecdote’, since the plural of ‘anecdote’ is ‘anecdotes.’ But each (true) anecdote does provide a data point of sorts. A sprinter won three gold medals on a diet consisting almost entirely of Chicken McNuggets, and his 100m time was a world record even though he didn’t run all the way through the tape. That really does set an upper limit on how deleterious a week of Chicken McNugget consumption is, at least to Usain Bolt. As far as data go, that anecdote is probably more informative than the quantitative results of Aschwanden’s 10-participant beer study, no matter how carefully the study was conducted.
One of the good things about Aschwanden’s book is that she puts the pieces together for us. She’s smart, she’s a former elite athlete herself (a professional cross-country skier), she talked to hundreds of people, she read lots of scientific studies, and she formed well-informed beliefs about everything she writes about. Only a tiny portion of those interviews and studies can fit in the book, but I trust her judgment well enough to think I’d probably reach most of the same conclusions she did, so I appreciate the fact that she does summarize her beliefs. A few key ones are: (1) ‘recovery’ involves both mind and body, and stress of all kinds — physical, mental, and emotional — hurts recovery of both mind and body. (2) Sleep is especially important to recovery; relaxation is too. If an athlete’s recovery routine is itself a source of stress, it’s counterproductive. (3) Under-eating is bad, and is worse than eating non-optimally. (4) The timing of food intake is unimportant unless you have a short break between events. If you finish an event and you have another one in a few hours, eating the right thing at the right time is critical. But if you aren’t competing again for 24 hours or more, there is no ‘nutrition window’, there’s a nutrition ‘barn door’, in the words of one researcher she quotes. (5) Other than getting enough sleep and enough relaxation, and eating enough to replenish glycogen supplies and calories in time for your next event, nearly nothing else is definitively known to be beneficial compared to just living an ordinary life between events. (6) Overtraining is real thing, with both physical and mental components, and overtraining can be worse than undertraining. (7) With regard to specific ‘recovery modalities’: Massage might or might not help; ice baths might or might not help (and in fact might harm recovery a little); various food supplements might or might not help; heat in various forms might or might not help; ibuprofen and other anti-inflammatories probably do a little physical harm in most people, but most athletes refuse to believe it; stretching probably doesn’t help most people. (8) Different things work differently for different people, so following the same recovery routine as your sports idol might not work for you; (9) Some recovery methods, maybe a lot of them, really do help some people simply due to the ‘placebo effect’, and there’s nothing wrong with that: if it helps, it helps.
If any of these points seem odd or wrong or questionable to you, then I suggest reading the book, because Ashwanden explains why she has adopted her viewpoint. If you agree with all of them but want support for them, that’s another reason to read the book. If you agree with them all, shrug, and say “yeah, that’s pretty much what I figured” then you can skip the book unless you are interested in some interesting stories like the one about Bolt.
At Databricks, we build platforms to enable data teams to solve the world’s toughest problems and we couldn’t do that without our wonderful Databricks Team. “Teamwork makes the dreamwork” is not only a central function of our product but it is also central to our culture. Learn more about Alexandra Cong, one of our Software Engineers, and what drew her to Databricks!
Tell us a little bit about yourself.
I’m a software engineer on the Identity and Access Management team. I joined Databricks almost 3 years ago after graduating from Caltech, and I’ve been here ever since!
What were you looking for in your next opportunity, and why did you choose Databricks?
Coming out of college, I was looking for a smaller company where I could not only learn and grow, but make an impact. As a math major, I didn’t have all of the software engineering basics, but interviewing at Databricks reassured me that as long as I was willing and excited to learn from the unknown, I could be successful. Being able to help solve a wide scope of challenges sounded really exciting, as opposed to being at a more established company, where they may have already solved a lot of their big problems. Finally, every person I met during my interviews at Databricks was not only extremely smart, but more importantly, humble and nice – which made me really excited to join the team!
What gets you excited to come to work every day?
It’s really important to me to be always learning and developing new skills. At Databricks, each team owns their services end-to-end and covers such a wide breadth that this is always the case. It’s an additional bonus that any feature you work on is mission-critical and will have a big impact – we don’t have the bandwidth to work on anything that isn’t!
One of our core values at Databricks is to be an owner. What is your most memorable experience at Databricks when you owned it?
I’m part of our diversity committee because I’m passionate about creating an inclusive and welcoming environment for everyone here. We recently sponsored an organization at UC Berkeley that runs a hackathon for under-resourced high school students. Databricks provided mentorship, sponsored prizes, and I got to teach students how to use Databricks to do their data analysis. It was really rewarding to give back to the community, see high school students get excited about coding and data, and be able to encourage even just a handful of students to study Computer Science.
What has been the biggest challenge you’ve faced, and what is a lesson you learned from it?
The biggest challenge I’ve faced so far has been overcoming the mental hurdles growing into a senior software engineer role. Upon first understanding the expectations, I felt overwhelmed and the challenges seemed insurmountable, to the point where I became unmotivated and unhappy. Slowly I came to terms that I would have to take on uncomfortable tasks that would challenge me, and that I would inevitably make mistakes in the process. However, it was a necessary part of my growth and I would just have to tackle these challenges one at a time. This was difficult for me because I hate failing and would rather only do things when I know I will be successful. However, through this process, I’ve learned that I’ll grow so much more if I’m willing to make mistakes and learn from them.
Databricks has grown tremendously in the last few years. How do you see the future of Databricks evolving and what are you most excited to see us accomplish?
I see Databricks being used more and more by companies across many different domains. In an ideal world, Databricks will become the standard for doing data analysis. It might even be a qualification that data analysts list on their resumes! Of course, we have a lot of work to do if we want to get to that point, but I think the market opportunity is huge and I hope that we’ll be able to execute well enough to see that become a reality.
What advice would you give to women in tech who are starting their careers?
Advocate for yourself. This comes in various forms – negotiations, promotions, mentorship, leading projects, or even just talking with your manager about furthering your career growth. At times, I fell into the trap of assuming that my work would speak for itself, and that I didn’t need to do anything on top of that. I’ve since learned that even if it feels outside my comfort zone, I need to actively ask for more if and when I think I deserve it, because no one will be a better advocate for me than myself.
Want to work with Alexandra? Check out our Careers Page.
--
Try Databricks for free. Get started today.
The post Brickster Spotlight: Meet Alexandra appeared first on Databricks.
Octoparse is the ultimate tool for data extraction (web crawling, data crawling and data scraping), which lets you turn the whole internet into a structured format. The newly launched Web Scraping Template makes it very easy even for people with no technical training.
Even the most sophisticated data science organizations struggle to keep track of their data science projects. Data science leaders want to know, at any given moment, not just how many data science projects are in flight but what the latest updates and roadblocks are when it comes to model development and what projects need their immediate attention.
But while there are a legion of tools for individual data scientists, the needs of data science leaders have not been well-served. For example, a VP of Analytics at a wealth management company recently told us he had to walk around the office, pen and notepad in-hand, going from person to person, in order to get an actual count of projects in flight because their traditional task tracking tools didn’t quite align with the workflow used by data science teams. It turned out that the final count was way off from the initial estimate provided to the CEO.
Data science leaders face a common set of challenges around visibility and governance:
Given the potential repercussions from inaccurate information (from mis-set expectations, funding mismatch to project delays) it didn’t surprise us that data science leaders packed the room at the Rev 2 Data Science Leaders Summit in New York for a live demo of our new “Control Center” functionalities designed specially for them.
P.s. If you missed Rev this year, session presentations and recordings can be found here.
Last fall, we delivered the Domino Control Center aimed at IT stakeholders with visibility into compute usage and spend. Today we are announcing a significant expansion of the Control Center with new features for data science leaders in Domino 3.5.
Domino 3.5 allows data science leaders to define their own data science project life cycle. A new addition to the Control Center, Projects Portfolio Dashboard, allows data science leaders to easily track and manage projects with a holistic understanding of the latest developments. It also surfaces projects that need immediate attention in real time by showing the projects that are blocked.
A data science leader can start their day in the Project Portfolio Dashboard, which shows a summary of in-flight projects broken down by configurable life cycle stages with immediate status update of all projects.
Every organization has their own data science life cycle that meets their business needs. In Domino 3.5, we enable data science leaders & managers to define their own project life cycle and implement within their teams.
Data scientists can update their project stages as they progress through the lifecycle which notifies their collaborators via email.
Projects owners and contributors can use the project stage menu to flag a project as blocked with a description of the blocker. Once resolved, the project can be unblocked. On the flip side, when data scientists mark a project as complete with a description of the project conclusion, Domino also captures this metadata for project tracking and future references. All of this metadata captured can be useful for organizational learning, organize projects and help to avoid similar issues in the future.
All of this information powers Domino’s new Projects Portfolio Dashboard. Data science leads can click through to gain more context on any of the inflight projects and discover blocked projects that need attention.
In the hypothetical project below, our Chief Data Scientist Josh sees that one of the blocked projects is Avinash and Niole’s Customer Churn project. Although he doesn’t recall the details of this project, he can see that it is in the R&D phase and has a hard stop in a few weeks. Diving into the project, he can see that the remaining goal is to get a classification model with AUC above 0.8.
Josh can turn to the Activity Feed to get details on the blocker, the causes and suggest a course of action. In this example, he will ask the Customer Churn team to try a deep neural net. He can tag Sushmitha, a deep learning expert working on another team, and ask her to mentor this effort.
Managing projects, tracking production assets, and monitoring organizational health require new tools. These unique features were custom-built for data science leaders. At Domino, we are excited to see these benefits come to you as you use them with your teams.
All of this has been just some of what’s new in Domino, we also have a few other enhancements to our existing features in the 3.5 release. For example, Activity feed has been enhanced to show a preview of the files that are being commented on. It also shows the project stage updates and if any blockers have been raised by collaborators. Users can also filter by the type of activities. This combined with email notifications will ensure situational awareness of the projects at all times.
Domino 3.5 offers the options for users to create large Dataset Snapshots directly from data sitting on their computers. The upload limits on the CLI have been increased to 50 GB and up to 50,000 files. With the same upload limits, users can also upload files directly through the browser. The CLI and browser uploads offer a seamless way to migrate and contribute data on your laptop into a single place for data science work. Teams can leverage shared, curated data and eliminate potentially redundant data wrangling work and ensure fully reproducible experiments.
To complement the new features of Control Center for data science leaders, we are also launching user activity analysis enhancements which facilitate license reporting and compliance. It offers a detailed view of the level of Domino activity for each team member so that data science and IT leaders can manage their allocation of Domino licenses and have visibility and predictability for their costs. Domino administrators can quickly identify active and inactive users and decide whether they should be allocated a licence. The ability to track user activity and growth during budget planning and contract renewal, makes it much easier to plan for future spending.
In addition to the exciting breakthrough new features for data science leaders, we are also launching a new Trial Environment to make Domino more accessible. It’s perfect for those who want to try it out and evaluate if it would be useful to your work. The new features in this latest release will be in our trial environment too! This is a quick and easy way to get access to Domino and start experiencing the secret sauce companies like Dealer Tire and Redhat leverage in their data science organization.
A more detailed release notes of Domino 3.5 can be found here. Domino 3.5 is currently generally available – be sure to check out our 3.5 release webinar or try Domino to see the latest platform capabilities.
Learn how to apply Python data science libraries to develop a simple optimization problem based on a Nobel-prize winning economic theory for maximizing investment profits while minimizing risk.
Seeking an individual passionate about undertaking research in an area of Information Technology (IT), being one of Software Engineering and Cybersecurity, Machine Learning and Artificial Intelligence, or Human-Centred Computing, as well as multidisciplinary research through the application of IT to problems, and to be responsible for conducting research in this area of IT.
In the biggest crossover event of the century, Tom Lum used the Wikipedia API to chart the number of views for every reference in Billy Joel’s We Didn’t Start the Fire. Yes. [via @waxpancake]
Tags: Billy Joel, humor, Wikipedia
The MIT Technology Review has named Facebook AI Research Scientist Noam Brown one of this year’s Innovators Under 35 for his research on AI and games. The award recognizes talented young innovators by region for their potential to transform the world through their contributions to science and technology. Previous winners have included Mark Zuckerberg in 2007 and Google cofounders Larry Page and Sergey Brin in 2002.
Noam Brown is best known for his work on the poker-playing AI system Libratus, which he developed at Carnegie Mellon University with his PhD adviser Tuomas Sandholm in 2017. Libratus was the first AI to defeat top poker players in two-player no-limit Texas Hold’em.
Brown took a moment to share thoughts about his past, current, and future research on game-playing bots, as well as the possible applications of his research. He also shared advice to anyone looking to pursue research in AI and game theory.
Q: Describe some key highlights from your research work. What projects are you proudest of?
Noam Brown: The project I am proudest of is definitely Libratus, which beat top human poker professionals in two-player no-limit poker. This was a challenging problem that had existed for decades. Poker involves hidden information, which makes it resistant to other AI techniques that were successful in games like chess and Go. The first four years of my PhD were focused on figuring out how to crack this problem by building off of decades of research from previous researchers in the field. Eventually it became clear to me what the path to victory was, but it still took another year of implementation to get it all working and to actually beat top humans in the game. It was very rewarding to see all the pieces come together into an actual agent that could beat top humans.
Q: What led you to focus on AI and game theory?
NB: My original goal was to get a PhD in economics, but after spending a couple of years working in the economics research field, I realized I also wanted to build things. Economics doesn’t provide that opportunity as much as computer science, and especially AI, does. I had always been interested in both AI and game theory, and the economics angle of game theory seemed like a natural fit for me given my background.
Q: What does research in this area look like five years from now? What problems still need to be solved?
NB: There has been tremendous progress in recent years on AI for purely adversarial zero-sum games like checkers, chess, Go, poker, Starcraft, and Dota 2. But the real world isn’t zero-sum. Researchers still don’t know how to tackle AI for partly cooperative and partly adversarial settings, like negotiations. The state of the art in this area is way behind human performance. I think this will be a major area of AI research in the next five years, and it is an area that can have tremendous real-world impact.
Q: How does research on game-playing bots tie to real-world applications? How does it tie to other fields of AI research?
NB: While Libratus plays poker, the techniques are not limited to poker. Poker is just a benchmark that allows us to compare the performance of these techniques with the peak of human ability. That’s true for other AI milestones in games as well. The research I’m doing is really about developing AI techniques that can handle strategic reasoning and hidden information in multi-agent settings. This is very important because most real-world strategic interactions involve some amount of hidden information. If an AI agent is to act and to help people in the real world, it must be able to cope with hidden information.
Q: What surprised you most about how the research in this field has evolved? What’s been harder or easier than you might have expected?
NB: As with most research, it was very hard to predict what the “magic ingredient” would be that would lead to superhuman performance. My early PhD research was focused on techniques that seemed like good ideas at the time but ultimately didn’t make a huge difference in performance. But if you keep making good shots, eventually you’ll score a goal.
Q: What would you say to other AI researchers or students who are considering focusing on game theory and AI?
NB: This is a very exciting time to be in this research area. Research on imperfect-information games was historically a bit outside of the mainstream of AI, but recent results have shown convincingly that it holds answers to questions that have vexed AI researchers for decades. This is an underexplored field with a lot left to be done. But most important, I think the key to doing good research is loving what you do.
The post Q&A with 2019 Innovator Under 35 Noam Brown appeared first on Facebook Research.
Distributed learning and random projections are the most common techniques in large scale nonparametric statistical learning. In this paper, we study the generalization properties of kernel ridge regression using both distributed methods and random features. Theoretical analysis shows the combination remarkably reduces computational cost while preserving the optimal generalization accuracy under standard assumptions. In a benign case, O(N−−√)partitions and O(N−−√) random features are sufficient to achieve O(1/N) learning rate, where N is the labeled sample size. Further, we derive more refined results by using additional unlabeled data to enlarge the number of partitions and by generating features in a data-dependent way to reduce the number of random features.
There is interest in converting relational query languages (that work both over SQL databases and on local data) into data.table
commands, to take advantage of data.table
‘s superior performance. Obviously if one wants to use data.table
it is best to learn data.table
. But if we want code that can run multiple places a translation layer may be in order.
In this note we look at how this translation is commonly done.
The dtplyr
developers recently announced they are making changes to dtplyr
to support two operation modes:
Note that there are two ways to use
dtplyr
:
- Eagerly [WIP]. When you use a dplyr verb directly on a data.table object, it
eagerly converts the dplyr code to data.table code, runs it, and returns a
new data.table. This is not very efficient because it can’t take advantage
of many of data.table’s best features.- Lazily. In this form, trigged by using
lazy_dt()
, no computation is
performed until you explicitly request it withas.data.table()
,
as.data.frame()
oras_tibble()
. This allows dtplyr to inspect the
full sequence of operations to figure out the best translation.(reference, and recently completely deleted)
This is a bit confusing, but we can unroll it a bit.
dplyr
(and later dtplyr
) has always converted dplyr
pipelines into data.table
realizations.dplyr
‘s strategy since the first released version of dplyr
(verson 0.1.1 2014-01-29).
data.table
. Our own rqdatatable
package has been calling data.table
this way for over a year (ref). It is very odd that dplyr
didn’t use this good strategy for the data.table
adaptor, as it is the strategy dplyr
uses in its SQL
adaptor.
Let’s take a look at the current published version of dtplyr
(0.0.3) and how its eager evaluation works. Consider the following 4 trivial functions: that each add one to a data.frame
column multiple times.
base_r_fn <- function(df) {
dt <- df
for(i in seq_len(nstep)) {
dt$x1 <- dt$x1 + 1
}
dt
}
dplyr_fn <- function(df) {
dt <- df
for(i in seq_len(nstep)) {
dt <- mutate(dt, x1 = x1 + 1)
}
dt
}
dtplyr_fn <- function(df) {
dt <- as.data.table(df)
for(i in seq_len(nstep)) {
dt <- mutate(dt, x1 = x1 + 1)
}
dt
}
data.table_fn <- function(df) {
dt <- as.data.table(df)
for(i in seq_len(nstep)) {
dt[, x1 := x1 + 1]
}
dt[]
}
base_r_fn()
is idiomatic R
code, dplyr_fn()
is idiomatic dplyr
code, dtplyr_fn()
is a idiomatic dplyr
code operating over a data.table
object (hence using dtplyr
), and data.table_fn()
is idiomatic data.table
code.
When we time running all of these functions operating on a 100000 row by 100 column data frame for 1000 steps we see each of them takes the following time to complete the task on average:
method mean_seconds 1: base_r 0.8367011 2: data.table 1.5592681 3: dplyr 2.6420171 4: dtplyr 151.0217646
The “eager” dtplyr
system is about 100 times slower than data.table
. This trivial task is one of the few times that data.table
isn’t by far the fastest implementation (in tasks involving grouped summaries, joins, and other non-trivial operations data.table
typically has a large performance advantage, ref).
Here is the same data presented graphically.
This is why we don’t consider “eager” the proper way to call data.table
, it artificially makes data.table
appear slow. This is the negative impression of data.table
that the dplyr
/dtplyr
adaptors have been falsely giving dplyr
users for the last five years. dplyr
users either felt they were getting the performance of data.table
through dplyr
(if they didn’t check timings), or got a (false) negative impression of data.table
(if they did check timings).
Details of the timings can be found here.
As we have said: the “don’t force so many extra copies” methodology has been in rqdatable
for quite some time, and in fact works well. Some timings on a similar problem are shared here.
Notice the two rqdatatable
timings have some translation overhead. This is why using data.table
directly is, in general, going to be a superior methodology.
I have been meaning to write this for a while, but with the dplyr
vs data.table
feud rising to new levels on Twitter the last couple of days, it all of a sudden seems more relevant. For those who don’t know what I am talking about, there are different ways of doing data science. There are the two major languages R and python, with their own implementations for analysing data. Then within R there are the different flavours of using the base language or applying the functions of the tidyverse
. Within the tidyverse
there is the dplyr
package to do data wrangling, of which the functionality of the data.table
package greatly overlaps. Each of the choices are wildly contested by users of both of the options. Oftentimes, these debates are first presented as objective comparisons between two options in which the favoured option clearly stands out. This then evokes fierce responses from the other camp and before you know it we are down to the point where we call each other’s choices not elegant or even ugly.
I hope to convince you that these debates are useless by looking into some of the underlying psychological principles that makes us vulnerable for this type of quarreling.
The main reason I think the discussion of the merits of different implementations is fruitless, is that the two camps can never fully understand each other. Sure, objective comparisons can be made in computation speed and to some extent functionality. We can read from the authors and maintainers what the motivation is for implementing something in a certain way. But the true prove of the pudding remains in the eating, that is enabling the user in effectively putting it to use. Before you can effectively and joyously apply a complex systems (which all these implementations are), you need to spend countless hours sweating and swearing, reading documents and googling error messages. Day after day, hour after hour, you will fight yourself into mastery of one of the systems. To be most effective and consistent almost everybody will have a go-to system, in which day-to-day task will be done. You don’t roll a dice in the mourning to decide if this is going to be an R or a python day. You don’t switch from data.table
to dplyr
in the middle of an analysis, without a very good reason. You stick with what you know, because it will give you the answer you are after the quickest. There is a major path-dependency here. You initially start with one of the systems and with each time you use it your understanding and appreciation grows. Because of this you will keep using the system and so your love affair begins. Before you know it, you have a cognitive lock-in from your weapon-of-choice.
A big part of the appreciation for the system comes from your understanding of it. This understanding reliefs the large cognitive strain of doing data science a little bit. In the phenomenal Thinking Fast and Slow by Daniel Kahneman a full chapter is devoted to the topic of cognitive ease. It is shown that you get a good feeling out of things that are relatively easy to you. There is a very good evolutionary explanation for this; your brain consumes massive amounts of energy so parsimonious use of it is rewarded by feeling great. It is you body telling you, keep doing this not so hard thing, you are going to last a long time this way. Because the time you spent developing skills in your favourite system, looking at code from this system will give you a lot more cognitive ease than looking at code from a system you are far less familiar with. To understand code from the unfamiliar system at the same level as the familiar one, will require spending a lot of cognitive resources. This will be accompanied by negative emotions such as frustration, feeling tired, and even anger. It is then a very understandable but also a very silly mistake that these emotions are an indication that the unfamiliar system is poorly implemented or even ugly. However, it is ignorance, not the software being bad that caused these emotions.
From cognitive psychology we turn to social psychology. In the seventies Henri Tajfel developed the theory of social identity. Part of the way we view ourselves is determined by the groups we are part of. If a group we are part of does do well by some measure, we will personally start to feel better about ourselves even when we did not had any part in the achievement. Just look at the crowds that celebrate at a victory parade of the local sports teams. They did not spent a minute on the field, they might not even been at the stadium supporting the players, and still they experience the victory as theirs as well. Using a software system a good part of your waking hours, day after day, will inevitably lead to that system being part of your social identity. Gradually you are not just using the software, you are becoming the software as well.
On its own this is not a bad thing, as humans we need this sense of belonging. However, it will also change your behaviour, and oftentimea not for the good. In order to boost your self-esteem you want your “team” to be on the winning end. Each year when Stackoverflow shares the rises and falls of use of software languages, users of both R and python jubilantly share the results (both are growing year after year). While this is an objective development in which you had no part other than being one of the users, discussions about the merits of software systems have active user involvement. Probably the easiest way to make your team look better is to make the other team look worse. Just think about the tiring debates between fans of different sports teams to discuss which is the best. Or the endless mud throwing at political races. Flame wars are no difference, by mocking the other system we celebrate our own and we are part of the winning team.
Now, here is the good news. In sports and politics, the different parties are also objectively in a competition. They play a zero-sum game, in which one’s victory must mean the other’s defeat. We as a data science community are not in a zero-sum game and we often seem to forget it. Even when the ‘competing’ system is on the rise you can do your job effectively and with joy in the on you prefer. Instead of mocking each other we should be thankful for the wealth of options we have to do our jobs. When our primary system does not offer the functionality we are looking for, we might find it in the other. The different systems also can influence each other positively, functionality in other systems might inspire authors to make theirs more complete.
I have looked into two psychological mechanisms that I think do stir-up flame wars. I hope it will make you think again before posting a comment on Twitter or starting a heated discussion with a colleague that leads nowhere. We have several options to do our daily jobs and each of them has proven itself in practice. Each is used by at least tens of thousands of analysts and programmers, who use them to bring real value to real organisations. Mocking one of them is not only harmful, it is disrespectful. Only the brightest and most determined of our peers could create systems that are so complex, complete and fault free. They have committed thousands of hours, often unpaid, to serve the community because they care. Mocking their labour because you don’t properly understand the system they designed or because you want to feel better about yourself is ignorant, and you should refrain from it.
Seeking an individual passionate about undertaking research in an area of Artificial Intelligence (AI), as well as multidisciplinary research through the application of AI to problems, and to be responsible for conducting research in areas of AI.
In the past decade, the amount of available genomic data has exploded as the price of genome sequencing has dropped. Researchers are now able to scan for associations between genetic variation and diseases across cohorts of hundreds of thousands of individuals from projects such as the UK Biobank. These analyses will lead to a deeper understanding of the root causes of disease that will lead to treatments for some of today’s most important health problems. However, the tools to analyze these data sets have not kept pace with the growth in data.
Many users are accustomed to using command line tools like plink or single-node Python and R scripts to work with genomic data. However, single node tools will not suffice at terabyte scale and beyond. The Hail project from the Broad Institute builds on top of Spark to distribute computation to multiple nodes, but it requires users to learn a new API in addition to Spark and encourages that data to be stored in a Hail-specific file format. Since genomic data holds value not in isolation but as one input to analyses that combine disparate sources such as medical records, insurance claims, and medical images, a separate system can cause serious complications.
We believe that Spark SQL, which has become the de facto standard for working with massive datasets of all different flavors, represents the most direct path to simple, scalable genomic workflows. Spark SQL is used for extracting, transforming, and loading (ETL) big data in a distributed fashion. ETL is 90% of the effort involved in bioinformatics, from extracting mutations, annotating them with external data sources, to preparing them for downstream statistical and machine learning analysis. Spark SQL contains high-level APIs in languages such as Python or R that are simple to learn and result in code that is easier to read and maintain than more traditional bioinformatics approaches. In this post, we will introduce the readers and writers that provide a robust, flexible connection between genomic data and Spark SQL.
Our readers are implemented as Spark SQL data sources, so VCF and BGEN can be read into a Spark DataFrame as simply as any other file type. In Python, reading a directory of VCF files looks like this:
spark.read\ .format("com.databricks.vcf")\ .option("includeSampleIds", True)\ .option("flattenInfoFields", True)\ .load("/databricks-datasets/genomics/1kg-vcfs")
The data types defined in the VCF header are translated to a schema for the output DataFrame. The VCF files in this example contain a number of annotations that become queryable fields:
The contents of a VCF file in a Spark SQL DataFrame
Fields that apply to each sample in a cohort—like the called genotype—are stored in an array, which enables fast aggregation for all samples at each site.
The array of per-sample genotype fields
As those who work with VCF files know all too well, the VCF specification leaves room for ambiguity in data formatting that can cause tools to fail in unexpected ways. We aimed to create a robust solution that was by default accepting of malformed records and then allow our users to choose filtering criteria. For instance, one of our customers used our reader to ingest problematic files where some probability values were stored as “nan” instead of “NaN”, which most Java-based tools require. Handling these simple issues automatically allows our users to focus on understanding what their data mean, not whether they are properly formatted. To verify the robustness of our reader, we have tested it against VCF files generated by common tools such as GATK and Edico Genomics as well as files from data sharing initiatives.
BGEN files such as those distributed by the UK Biobank initiative can be handled similarly. The code to read a BGEN file looks nearly identical to our VCF example:
spark.read.format("com.databricks.bgen").load(bgen_path)
These file readers produce compatible schemas that allow users to write pipelines that work for different sources of variation data and enable merging of different genomic datasets. For instance, the VCF reader can take a directory of files with differing INFO fields and return a single DataFrame that contains the common fields. The following commands read in data from BGEN and VCF files and merge them to create a single dataset:
vcf_df = spark.read.format(“com.databricks.vcf”).load(vcf_path) bgen_df = spark.read.format(“com.databricks.bgen”)\ .schema(vcf_df.schema).load(bgen_path) big_df = vcf_df.union(bgen_df) # All my genotypes!!
Since our file readers return vanilla Spark SQL DataFrames, you can ingest variant data using any of the programming languages supported by Spark, like Python, R, Scala, Java, or pure SQL. Specialized frontend APIs such as Koalas, which implements the pandas dataframe API on Apache Spark, and sparklyr work seamlessly as well.
Since each variant-level annotation (the INFO fields in a VCF) corresponds to a DataFrame column, queries can easily access these values. For example, we can count the number of biallelic variants with minor allele frequency less than 0.05:
Spark 2.4 introduced higher-order functions that simplify queries over array data. We can take advantage of this feature to manipulate the array of genotypes. To filter the genotypes array so that it only contains samples with at least one variant allele, we can write a query like this:
Manipulating the genotypes array with higher order functions
If you have tabix indexes for your VCF files, our data source will push filters on genomic locus to the index and minimize I/O costs. Even as datasets grow beyond the size that a single machine can support, simple queries still complete at interactive speeds.
As we mentioned when we discussed ingesting variation data, any language that Spark supports can be used to write queries. The above statements can be combined into a single SQL query:
Querying a VCF file with SQL
We believe that in the near future, organizations will store and manage their genomic data just as they do with other data types, using technologies like Delta Lake. However, we understand that it’s important to have backward compatibility with familiar file formats for sharing with collaborators or working with legacy tools.
We can build on our filtering example to create a block gzipped VCF file that contains all variants with allele frequency less than 5%:
df.where(fx.expr("INFO_AF[0] < 0.05"))\ .orderBy(“contigName”, “start”)\ .write.format(“com.databricks.bigvcf”)\ .save(“output.vcf.bgz”)
This command sorts, serializes, and uploads each segment of the output VCF in parallel, so you can safely output cohort-scale VCFs. It’s also possible to export one VCF per chromosome or on even smaller granularities.
Saving the same data to a BGEN file requires only one small modification to the code:
df.where(fx.expr("INFO_AF[0] < 0.05"))\ .orderBy(“contigName”, “start”)\ .write.format(“com.databricks.bigbgen”)\ .save(“output.bgen”)
Ingesting data into Spark is the first step of most big data pipelines, but it’s hardly the end of the journey. In the next few weeks, we’ll have more blog posts that demonstrate how features built on top of these readers and writers can scale and simplify genomic workloads. Stay tuned!
Our Spark SQL readers make it easy to ingest large variation datasets with a small amount of code (Azure | AWS). Learn more about our genomics solutions in the Databricks Unified Analytics for Genomics and try out a preview today.
--
Try Databricks for free. Get started today.
The post Scaling Genomic Workflows with Spark SQL BGEN and VCF Readers appeared first on Databricks.
Gradient descent is an optimization algorithm used for minimizing the cost function in various ML algorithms. Here are some common gradient descent optimisation algorithms used in the popular deep learning frameworks such as TensorFlow and Keras.
In this article, you learn how to do linear algebra for Machine Learning and Deep Learning in R. In particular, I will discuss: Matrix Multiplication, Solve System of Linear Equations, Identity Matrix, Matrix Inverse, Solve System of Linear Equations Revisited, Finding the Determinant, Matrix Norm, Frobenius Norm, Special Matrices and Vectors, Eigendecomposition, Singular Value Decomposition, Moore-Penrose Pseudoinverse, and Matrix Trace.
Linear algebra is a branch of mathematics that is widely used throughout data science. Yet because linear algebra is a form of continuous rather than discrete mathematics, many data scientists have little experience with it. A good understanding of linear algebra is essential for understanding and working with machine learning and deep learning algorithms. This article is particularly aimed at linear algebra for these two econometrical/statistical disciplines. Let us dive into the world of linear algebra for machine learning and deep learning with R:
Let us start by creating a Matrix Multiplication in R:
A <- matrix(data = 1:36, nrow = 6) A [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 7 13 19 25 31 [2,] 2 8 14 20 26 32 [3,] 3 9 15 21 27 33 [4,] 4 10 16 22 28 34 [5,] 5 11 17 23 29 35 [6,] 6 12 18 24 30 36
B <- matrix(data = 1:30, nrow = 6) B [,1] [,2] [,3] [,4] [,5] [1,] 1 7 13 19 25 [2,] 2 8 14 20 26 [3,] 3 9 15 21 27 [4,] 4 10 16 22 28 [5,] 5 11 17 23 29 [6,] 6 12 18 24 30
A %*% B [,1] [,2] [,3] [,4] [,5] [1,] 441 1017 1593 2169 2745 [2,] 462 1074 1686 2298 2910 [3,] 483 1131 1779 2427 3075 [4,] 504 1188 1872 2556 3240 [5,] 525 1245 1965 2685 3405 [6,] 546 1302 2058 2814 3570
Let us try creating a Hadamard Multiplication in R:
A <- matrix(data = 1:36, nrow = 6) A [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 7 13 19 25 31 [2,] 2 8 14 20 26 32 [3,] 3 9 15 21 27 33 [4,] 4 10 16 22 28 34 [5,] 5 11 17 23 29 35 [6,] 6 12 18 24 30 36
B <- matrix(data = 11:46, nrow = 6) B [,1] [,2] [,3] [,4] [,5] [,6] [1,] 11 17 23 29 35 41 [2,] 12 18 24 30 36 42 [3,] 13 19 25 31 37 43 [4,] 14 20 26 32 38 44 [5,] 15 21 27 33 39 45 [6,] 16 22 28 34 40 46
A * B [,1] [,2] [,3] [,4] [,5] [,6] [1,] 11 119 299 551 875 1271 [2,] 24 144 336 600 936 1344 [3,] 39 171 375 651 999 1419 [4,] 56 200 416 704 1064 1496 [5,] 75 231 459 759 1131 1575 [6,] 96 264 504 816 1200 1656
Let us now create Dot Product in R:
X <- matrix(data = 1:10, nrow = 10) X [,1] [1,] 1 [2,] 2 [3,] 3 [4,] 4 [5,] 5 [6,] 6 [7,] 7 [8,] 8 [9,] 9 [10,] 10
Y <- matrix(data = 11:20, nrow = 10) Y [,1] [1,] 11 [2,] 12 [3,] 13 [4,] 14 [5,] 15 [6,] 16 [7,] 17 [8,] 18 [9,] 19 [10,] 20
Let us now create Dot Product function in R:
dotProduct <- function(X, Y) { as.vector(t(X) %*% Y) } dotProduct(X, Y) [1] 935
Let us look at Properties of Matrix Multiplication in R:
#1 Matrix Property: It is Distributive Matrix
A <- matrix(data = 1:25, nrow = 5) B <- matrix(data = 26:50, nrow = 5) C <- matrix(data = 51:75, nrow = 5)
A %*% (B + C) [,1] [,2] [,3] [,4] [,5] [1,] 4555 5105 5655 6205 6755 [2,] 4960 5560 6160 6760 7360 [3,] 5365 6015 6665 7315 7965 [4,] 5770 6470 7170 7870 8570 [5,] 6175 6925 7675 8425 9175
A %*% B + A %*% C [,1] [,2] [,3] [,4] [,5] [1,] 4555 5105 5655 6205 6755 [2,] 4960 5560 6160 6760 7360 [3,] 5365 6015 6665 7315 7965 [4,] 5770 6470 7170 7870 8570 [5,] 6175 6925 7675 8425 9175
#2 Matrix Property: It is Associative Matrix
A <- matrix(data = 1:25, nrow = 5) B <- matrix(data = 26:50, nrow = 5) C <- matrix(data = 51:75, nrow = 5)
A %*% B) %*% C [,1] [,2] [,3] [,4] [,5] [1,] 569850 623350 676850 730350 783850 [2,] 620450 678700 736950 795200 853450 [3,] 671050 734050 797050 860050 923050 [4,] 721650 789400 857150 924900 992650 [5,] 772250 844750 917250 989750 1062250
A %*% (B %*% C) [,1] [,2] [,3] [,4] [,5] [1,] 569850 623350 676850 730350 783850 [2,] 620450 678700 736950 795200 853450 [3,] 671050 734050 797050 860050 923050 [4,] 721650 789400 857150 924900 992650 [5,] 772250 844750 917250 989750 1062250
#3 Matrix Property: It is Not Commutative Matrix
A <- matrix(data = 1:25, nrow = 5) B <- matrix(data = 26:50, nrow = 5)
A %*% B [,1] [,2] [,3] [,4] [,5] [1,] 1590 1865 2140 2415 2690 [2,] 1730 2030 2330 2630 2930 [3,] 1870 2195 2520 2845 3170 [4,] 2010 2360 2710 3060 3410 [5,] 2150 2525 2900 3275 3650
B %*% A [,1] [,2] [,3] [,4] [,5] [1,] 590 1490 2390 3290 4190 [2,] 605 1530 2455 3380 4305 [3,] 620 1570 2520 3470 4420 [4,] 635 1610 2585 3560 4535 [5,] 650 1650 2650 3650 4650
Let us look at Matrix Transpose in R:
A <- matrix(data = 1:25, nrow = 5, ncol = 5, byrow = TRUE) A [,1] [,2] [,3] [,4] [,5] [1,] 1 2 3 4 5 [2,] 6 7 8 9 10 [3,] 11 12 13 14 15 [4,] 16 17 18 19 20 [5,] 21 22 23 24 25
t(A) [,1] [,2] [,3] [,4] [,5] [1,] 1 6 11 16 21 [2,] 2 7 12 17 22 [3,] 3 8 13 18 23 [4,] 4 9 14 19 24 [5,] 5 10 15 20 25
Let us look at Matrix Transpose Property in R:
A <- matrix(data = 1:25, nrow = 5) B <- matrix(data = 25:49, nrow = 5)
t(A %*% B) [,1] [,2] [,3] [,4] [,5] [1,] 1535 1670 1805 1940 2075 [2,] 1810 1970 2130 2290 2450 [3,] 2085 2270 2455 2640 2825 [4,] 2360 2570 2780 2990 3200 [5,] 2635 2870 3105 3340 3575
t(B) %*% t(A) [,1] [,2] [,3] [,4] [,5] [1,] 1535 1670 1805 1940 2075 [2,] 1810 1970 2130 2290 2450 [3,] 2085 2270 2455 2640 2825 [4,] 2360 2570 2780 2990 3200 [5,] 2635 2870 3105 3340 3575
Now let us Solve System of Linear Equations in R:
Ax = B
A <- matrix(data = c(1, 3, 2, 4, 2, 4, 3, 5, 1, 6, 7, 2, 1, 5, 6, 7), nrow = 4, byrow = TRUE) A [,1] [,2] [,3] [,4] [1,] 1 3 2 4 [2,] 2 4 3 5 [3,] 1 6 7 2 [4,] 1 5 6 7
B <- matrix(data = c(1, 2, 3, 4), nrow = 4) B [,1] [1,] 1 [2,] 2 [3,] 3 [4,] 4
solve(a = A, b = B) [,1] [1,] 0.6153846 [2,] -0.8461538 [3,] 1.0000000 [4,] 0.2307692
Now let us Solve Identity Matrix in R:
I <- diag(x = 1, nrow = 5, ncol = 5) I [,1] [,2] [,3] [,4] [,5] [1,] 1 0 0 0 0 [2,] 0 1 0 0 0 [3,] 0 0 1 0 0 [4,] 0 0 0 1 0 [5,] 0 0 0 0 1
A <- matrix(data = 1:25, nrow = 5)
A %*% I [,1] [,2] [,3] [,4] [,5] [1,] 1 6 11 16 21 [2,] 2 7 12 17 22 [3,] 3 8 13 18 23 [4,] 4 9 14 19 24 [5,] 5 10 15 20 25
I %*% A [,1] [,2] [,3] [,4] [,5] [1,] 1 6 11 16 21 [2,] 2 7 12 17 22 [3,] 3 8 13 18 23 [4,] 4 9 14 19 24 [5,] 5 10 15 20 25
Now let us Solve Matrix Inverse in R:
A <- matrix(data = c(1, 2, 3, 1, 2, 3, 4, 5, 6, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 3), nrow = 5)
A [,1] [,2] [,3] [,4] [,5] [1,] 1 3 3 8 4 [2,] 2 4 4 9 5 [3,] 3 5 5 1 6 [4,] 1 6 6 2 7 [5,] 2 2 7 3 3
library(MASS) ginv(A) [,1] [,2] [,3] [,4] [,5] [1,] -0.3333333 0.3333333 0.3333333 -3.333333e-01 1.040834e-16 [2,] -4.0888889 3.6444444 -1.2222222 8.666667e-01 -2.000000e-01 [3,] -0.3555556 0.2444444 -0.2222222 1.333333e-01 2.000000e-01 [4,] -0.1111111 0.2222222 -0.1111111 -6.938894e-18 2.602085e-18 [5,] 3.8888889 -3.4444444 1.2222222 -6.666667e-01 -2.664535e-15
ginv(A) %*% A [,1] [,2] [,3] [,4] [,5] [1,] 1.000000e+00 -6.800116e-16 -1.595946e-16 9.020562e-17 -1.020017e-15 [2,] 8.881784e-16 1.000000e+00 -5.329071e-15 -1.287859e-14 -2.464695e-14 [3,] -1.665335e-16 -1.387779e-15 1.000000e+00 -1.332268e-15 -1.998401e-15 [4,] -2.237793e-16 -8.135853e-16 -8.005749e-16 1.000000e+00 -1.262011e-15 [5,] 0.000000e+00 1.953993e-14 6.217249e-15 1.265654e-14 1.000000e+00
A %*% ginv(A) [,1] [,2] [,3] [,4] [,5] [1,] 1.000000e+00 1.776357e-15 -1.776357e-15 2.220446e-15 -1.200429e-15 [2,] -7.105427e-15 1.000000e+00 -1.776357e-15 1.776357e-15 -5.316927e-16 [3,] -3.552714e-15 0.000000e+00 1.000000e+00 1.776357e-15 1.136244e-16 [4,] 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 5.204170e-18 [5,] -5.329071e-15 5.329071e-15 -8.881784e-16 1.998401e-15 1.000000e+00
Let us look at Solve System of Linear Equations Revisited in R:
A <- matrix(data = c(1, 3, 2, 4, 2, 4, 3, 5, 1, 6, 7, 2, 1, 5, 6, 7), nrow = 4, byrow = TRUE)
A [,1] [,2] [,3] [,4] [1,] 1 3 2 4 [2,] 2 4 3 5 [3,] 1 6 7 2 [4,] 1 5 6 7
B <- matrix(data = c(1, 2, 3, 4), nrow = 4) B [,1] [1,] 1 [2,] 2 [3,] 3 [4,] 4
library(MASS) X <- ginv(A) %*% B X [,1] [1,] 0.6153846 [2,] -0.8461538 [3,] 1.0000000 [4,] 0.2307692
Let us find the Determinant Matrix in R:
A <- matrix(data = c(1, 3, 2, 4, 2, 4, 3, 5, 1, 6, 7, 2, 1, 5, 6, 7), nrow = 4, byrow = TRUE)
A [,1] [,2] [,3] [,4] [1,] 1 3 2 4 [2,] 2 4 3 5 [3,] 1 6 7 2 [4,] 1 5 6 7
det(A) [1] -39
Let us now look at the Matrix Norm:
lpNorm = 1 & dim(A)[[2]] == 1 && is.infinite(p) == FALSE) { sum((apply(X = A, MARGIN = 1, FUN = abs)) ** p) ** (1 / p) } else if (p >= 1 & dim(A)[[2]] == 1 & is.infinite(p)) { max(apply(X = A, MARGIN = 1, FUN = abs)) Max Norm } else { invisible(NULL) } }
lpNorm(A = matrix(data = 1:10), p = 1) [1] 55
lpNorm(A = matrix(data = 1:10), p = 2) #Euclidean Distance [1] 19.62142
lpNorm(A = matrix(data = 1:10), p = 3) [1] 14.46245
lpNorm(A = matrix(data = -100:10), p = Inf) [1] 100
Let us find Properties between Matrix Norm and the Determinant Matrix in R:
lpNorm(A = matrix(data = rep(0, 10)), p = 1) == 0 [1] TRUE
lpNorm(A = matrix(data = 1:10) + matrix(data = 11:20), p = 1) <= lpNorm(A = matrix(data = 1:10), p = 1) + lpNorm(A = matrix(data = 11:20), p = 1) [1] TRUE
tempFunc <- function(i) { lpNorm(A = i * matrix(data = 1:10), p = 1) == abs(i) * lpNorm(A = matrix(data = 1:10), p = 1) }
all(sapply(X = -10:10, FUN = tempFunc)) [1] TRUE
The Frobenius norm, sometimes also called the Euclidean norm (a term unfortunately also used for the vector L^2-norm), is matrix norm of an m×n matrix A defined as the square root of the sum of the absolute squares of its elements. The Frobenius norm is the only one out of the above three matrix norms that is unitary invariant, i.e., it is conserved or invariant under a unitary transformation.
Let us find Frobenius Norm Matrix in R:
frobeniusNorm <- function(A) { (sum((as.numeric(A)) ** 2)) ** (1 / 2) }
frobeniusNorm(A = matrix(data = 1:25, nrow = 5)) [1] 74.33034
Let us look at Special Matrices and Vectors in R:
#1 Special Matrix: The Diagonal Matrix
A <- diag(x = c(1:5, 6, 1, 2, 3, 4), nrow = 10) A [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 1 0 0 0 0 0 0 0 0 0 [2,] 0 2 0 0 0 0 0 0 0 0 [3,] 0 0 3 0 0 0 0 0 0 0 [4,] 0 0 0 4 0 0 0 0 0 0 [5,] 0 0 0 0 5 0 0 0 0 0 [6,] 0 0 0 0 0 6 0 0 0 0 [7,] 0 0 0 0 0 0 1 0 0 0 [8,] 0 0 0 0 0 0 0 2 0 0 [9,] 0 0 0 0 0 0 0 0 3 0 [10,] 0 0 0 0 0 0 0 0 0 4
X <- matrix(data = 21:30) X [,1] [1,] 21 [2,] 22 [3,] 23 [4,] 24 [5,] 25 [6,] 26 [7,] 27 [8,] 28 [9,] 29 [10,] 30
A %*% X [,1] [1,] 21 [2,] 44 [3,] 69 [4,] 96 [5,] 125 [6,] 156 [7,] 27 [8,] 56 [9,] 87 [10,] 120
library(MASS) ginv(A) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 1 0.0 0.0000000 0.00 0.0 0.0000000 0 0.0 0.0000000 0.00 [2,] 0 0.5 0.0000000 0.00 0.0 0.0000000 0 0.0 0.0000000 0.00 [3,] 0 0.0 0.3333333 0.00 0.0 0.0000000 0 0.0 0.0000000 0.00 [4,] 0 0.0 0.0000000 0.25 0.0 0.0000000 0 0.0 0.0000000 0.00 [5,] 0 0.0 0.0000000 0.00 0.2 0.0000000 0 0.0 0.0000000 0.00 [6,] 0 0.0 0.0000000 0.00 0.0 0.1666667 0 0.0 0.0000000 0.00 [7,] 0 0.0 0.0000000 0.00 0.0 0.0000000 1 0.0 0.0000000 0.00 [8,] 0 0.0 0.0000000 0.00 0.0 0.0000000 0 0.5 0.0000000 0.00 [9,] 0 0.0 0.0000000 0.00 0.0 0.0000000 0 0.0 0.3333333 0.00 [10,] 0 0.0 0.0000000 0.00 0.0 0.0000000 0 0.0 0.0000000 0.25
#2 Special Matrix: The Symmetric Matrix
A <- matrix(data = c(1, 2, 2, 1), nrow = 2) A [,1] [,2] [1,] 1 2 [2,] 2 1
all(A == t(A)) [1] TRUE
#3 Special Matrix: The Unit Vector
lpNorm(A = matrix(data = c(1, 0, 0, 0)), p = 2) [1] 1
#4 Special Matrix: Orthogonal Vectors
X <- matrix(data = c(11, 0, 0, 0)) Y <- matrix(data = c(0, 11, 0, 0))
all(t(X) %*% Y == 0) [1] TRUE
#4 Special Matrix: More Orthogonal Vectors
X <- matrix(data = c(1, 0, 0, 0)) Y <- matrix(data = c(0, 1, 0, 0))
lpNorm(A = X, p = 2) == 1 [1] TRUE
lpNorm(A = Y, p = 2) == 1 [1] TRUE
all(t(X) %*% Y == 0) [1] TRUE
#4 Special Matrix: Still More Orthogonal Vectors
A <- matrix(data = c(1, 0, 0, 0, 1, 0, 0, 0, 1), nrow = 3, byrow = TRUE)
A [,1] [,2] [,3] [1,] 1 0 0 [2,] 0 1 0 [3,] 0 0 1
all(t(A) %*% A == A %*% t(A)) [1] TRUE
all(t(A) %*% A == diag(x = 1, nrow = 3)) [1] TRUE
library(MASS) all(t(A) == ginv(A)) [1] TRUE
Let us Now look at Eigendecomposition in R:
A <- matrix(data = 1:25, nrow = 5, byrow = TRUE) A [,1] [,2] [,3] [,4] [,5] [1,] 1 2 3 4 5 [2,] 6 7 8 9 10 [3,] 11 12 13 14 15 [4,] 16 17 18 19 20 [5,] 21 22 23 24 25
y <- eigen(x = A) library(MASS) all.equal(y$vectors %*% diag(y$values) %*% ginv(y$vectors), A) [1] TRUE
Now let us look at Singular Value Decomposition in R:
A <- matrix(data = 1:36, nrow = 6, byrow = TRUE) A [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 2 3 4 5 6 [2,] 7 8 9 10 11 12 [3,] 13 14 15 16 17 18 [4,] 19 20 21 22 23 24 [5,] 25 26 27 28 29 30 [6,] 31 32 33 34 35 36
y <- svd(x = A) y $d [1] 1.272064e+02 4.952580e+00 1.068280e-14 3.258502e-15 9.240498e-16 [6] 6.865073e-16 $u [,1] [,2] [,3] [,4] [,5] [1,] -0.06954892 -0.72039744 0.6716423 -0.11924367 0.08965916 [2,] -0.18479698 -0.51096788 -0.6087484 -0.06762569 0.44007566 [3,] -0.30004504 -0.30153832 -0.3722328 -0.09266448 -0.41109295 [4,] -0.41529310 -0.09210875 0.0313011 0.21692481 -0.67511264 [5,] -0.53054116 0.11732081 0.1308779 0.71086492 0.37490569 [6,] -0.64578922 0.32675037 0.1471598 -0.64825589 0.18156508 [,6] [1,] -0.05319067 [2,] 0.36871061 [3,] -0.70915885 [4,] 0.56145739 [5,] -0.20432734 [6,] 0.03650885 $v [,1] [,2] [,3] [,4] [,5] [,6] [1,] -0.3650545 0.62493577 0.54215504 0.08199306 -0.1033873 -0.4060131 [2,] -0.3819249 0.38648609 -0.23874067 -0.40371901 0.5758949 0.3913066 [3,] -0.3987952 0.14803642 -0.75665994 0.29137287 -0.2722608 -0.2957858 [4,] -0.4156655 -0.09041326 0.14938782 -0.18121587 -0.6652579 0.5668542 [5,] -0.4325358 -0.32886294 0.21539167 0.69322385 0.3606549 0.2184882 [6,] -0.4494062 -0.56731262 0.08846608 -0.48165491 0.1043562 -0.4748500
all.equal(y$u %*% diag(y$d) %*% t(y$v), A) [1] TRUE
Now let us look at Moore-Penrose Pseudoinverse in R:
A <- matrix(data = 1:25, nrow = 5) A [,1] [,2] [,3] [,4] [,5] [1,] 1 6 11 16 21 [2,] 2 7 12 17 22 [3,] 3 8 13 18 23 [4,] 4 9 14 19 24 [5,] 5 10 15 20 25
B <- ginv(A) B [,1] [,2] [,3] [,4] [,5] [1,] -0.152 -0.08 -8.00000e-03 0.064 0.136 [2,] -0.096 -0.05 -4.00000e-03 0.042 0.088 [3,] -0.040 -0.02 -9.97466e-18 0.020 0.040 [4,] 0.016 0.01 4.00000e-03 -0.002 -0.008 [5,] 0.072 0.04 8.00000e-03 -0.024 -0.056
y <- svd(A) all.equal(y$v %*% ginv(diag(y$d)) %*% t(y$u), B) [1] TRUE
Lastly, let us look at Trace Matrix in R:
A <- diag(x = 1:10) A [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 1 0 0 0 0 0 0 0 0 0 [2,] 0 2 0 0 0 0 0 0 0 0 [3,] 0 0 3 0 0 0 0 0 0 0 [4,] 0 0 0 4 0 0 0 0 0 0 [5,] 0 0 0 0 5 0 0 0 0 0 [6,] 0 0 0 0 0 6 0 0 0 0 [7,] 0 0 0 0 0 0 7 0 0 0 [8,] 0 0 0 0 0 0 0 8 0 0 [9,] 0 0 0 0 0 0 0 0 9 0 [10,] 0 0 0 0 0 0 0 0 0 10
library(psych) tr(A) [1] 55
We can also code a function that calculates the Trace Matrix of the Frobenius Norm Matrix:
alternativeFrobeniusNorm <- function(A) { sqrt(tr(t(A) %*% A)) } alternativeFrobeniusNorm(A) [1] 19.62142
frobeniusNorm(A) [1] 19.62142
all.equal(tr(A), tr(t(A))) [1] TRUE
Let us look at the diagonals (1:5) of the Trace Matrix of the Frobenius Norm Trace Matrix:
A <- diag(x = 1:5) A [,1] [,2] [,3] [,4] [,5] [1,] 1 0 0 0 0 [2,] 0 2 0 0 0 [3,] 0 0 3 0 0 [4,] 0 0 0 4 0 [5,] 0 0 0 0 5
Let us also look at the diagonals (6:10) of the Trace Matrix of the Frobenius Norm Trace Matrix:
B <- diag(x = 6:10) B [,1] [,2] [,3] [,4] [,5] [1,] 6 0 0 0 0 [2,] 0 7 0 0 0 [3,] 0 0 8 0 0 [4,] 0 0 0 9 0 [5,] 0 0 0 0 10
Let us anow look at the diagonals (11:15) of the Trace Matrix of the Frobenius Norm Trace Matrix:
C <- diag(x = 11:15) C [,1] [,2] [,3] [,4] [,5] [1,] 11 0 0 0 0 [2,] 0 12 0 0 0 [3,] 0 0 13 0 0 [4,] 0 0 0 14 0 [5,] 0 0 0 0 15
Let us alsos see if all conditions of the Frobenius Norm Trace Matrix are true:
all.equal(tr(A %*% B %*% C), tr(C %*% A %*% B)) [1] TRUE
all.equal(tr(C %*% A %*% B), tr(B %*% C %*% A)) [1] TRUE
Related Post
Bob writes, to someone who is doing work on the Stan language:
The basic execution structure of Stan is in the JSS paper (by Bob Carpenter, Andrew Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell) and in the reference manual. The details of autodiff are in the arXiv paper (by Bob Carpenter, Matt Hoffman, Marcus Brubaker, Daniel Lee, Peter Li, and Michael Betancourt). These are sort of background for what we’re trying to do.
If you haven’t read Maria Gorinova’s MS thesis and POPL paper (with Andrew Gordon and Charles Sutton), you should probably start there.
Radford Neal’s intro to HMC is nice, as is the one in David McKay’s book. Michael Betancourt’s papers are the thing to read to understand HMC deeply—he just wrote another brain bender on geometric autodiff (all on arXiv). Starting with the one on hierarchical models would be good as it explains the necessity of reparameterizations.
Also I recommend our JEBS paper (with Daniel Lee, and Jiqiang Guo) as it presents Stan from a user’s rather than a developer’s perspective.
And, for more general background on Bayesian data analysis, we recommend Statistical Rethinking by Richard McElreath and BDA3.
Despite being on holiday I’m getting in a bit of non-work R coding since the fam has a greater ability to sleep late than I do. Apart from other things I’ve been working on a PR into {lutz}, a package by @andyteucher that turns lat/lng pairs into timezone strings.
The package is super neat and has two modes: “fast” (originally based on a {V8}-backed version of @darkskyapp’s tzlookup javascript module) and “accurate” using R’s amazing spatial ops.
I ported the javascript algorithm to C++/Rcpp and have been tweaking the bit of package helper code that fetches this:
and extracts the embedded string tree and corresponding timezones array and turns both into something C++ can use.
Originally I just made a header file with the same long lines:
but that’s icky and fairly bad form, especially given that C++ will combine adjacent string literals for you.
The stringi::stri_wrap()
function can easily take care of wrapping the time zone array elements for us:
but, I also needed the ability to hard-wrap the encoded string tree at a fixed width. There are lots of ways to do that, here are three of them:
library(Rcpp)
library(stringi)
library(tidyverse)
library(microbenchmark)
sourceCpp(code = "
#include
// [[Rcpp::export]]
std::vector< std::string > fold_cpp(const std::string& input, int width) {
int sz = input.length() / width;
std::vector< std::string > out;
out.reserve(sz); // shld make this more efficient
for (unsigned long idx=0; idx<sz; );="" code="" input.substr(idx*width,="" nchar(input),="" (input.length()="" <-="" if="" !="0)" fold_tidy="" return(out);="" substr(input,="" map_chr(="" )="" +="" -="" width="" width)="" ")="" fun.value="character(1)" <="" 1),="" vapply(="" idx,="" out.push_back(input.substr(width*sz));="" ~stri_sub(input,="" out.push_back(="" %="" idx++)="" function(idx)="" seq(1,="" idx="" function(input,="" length="width)" width),="" .x,="" {="" }="" fold_base=""></sz;>
(If you know of a package that has this type of function def leave a note in the comments).
Each one does the same thing: move n
sequences of width
characters into a new slot in a character vector. Let’s see what they do with this toy long string example:
(src <- paste0(c(rep("a", 30), rep("b", 30), rep("c", 4)), collapse = ""))
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbcccc"
for (n in c(1, 7, 30, 40)) {
print(fold_base(src, n))
print(fold_tidy(src, n))
print(fold_cpp(src, n))
cat("\n")
}
## [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
## [18] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b"
## [35] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b"
## [52] "b" "b" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c"
## [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
## [18] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b"
## [35] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b"
## [52] "b" "b" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c"
## [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
## [18] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b"
## [35] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b"
## [52] "b" "b" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c"
##
## [1] "aaaaaaa" "aaaaaaa" "aaaaaaa" "aaaaaaa" "aabbbbb" "bbbbbbb"
## [7] "bbbbbbb" "bbbbbbb" "bbbbccc" "c"
## [1] "aaaaaaa" "aaaaaaa" "aaaaaaa" "aaaaaaa" "aabbbbb" "bbbbbbb"
## [7] "bbbbbbb" "bbbbbbb" "bbbbccc" "c"
## [1] "aaaaaaa" "aaaaaaa" "aaaaaaa" "aaaaaaa" "aabbbbb" "bbbbbbb"
## [7] "bbbbbbb" "bbbbbbb" "bbbbccc" "c"
##
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
## [3] "cccc"
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
## [3] "cccc"
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
## [3] "cccc"
##
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbb"
## [2] "bbbbbbbbbbbbbbbbbbbbcccc"
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbb"
## [2] "bbbbbbbbbbbbbbbbbbbbcccc"
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbb"
## [2] "bbbbbbbbbbbbbbbbbbbbcccc"
So, we know they all work, which means we can take a look at which one is faster. Let’s compare folding at various widths:
map_df(c(1, 3, 5, 7, 10, 20, 30, 40, 70), ~{
microbenchmark(
base = fold_base(src, .x),
tidy = fold_tidy(src, .x),
cpp = fold_cpp(src, .x)
) %>%
mutate(width = .x) %>%
as_tibble()
}) %>%
mutate(
width = factor(width,
levels = sort(unique(width)),
ordered = TRUE)
) -> bench_df
ggplot(bench_df, aes(expr, time)) +
ggbeeswarm::geom_quasirandom(
aes(group = width, fill = width),
groupOnX = TRUE, shape = 21, color = "white", size = 3, stroke = 0.125, alpha = 1/4
) +
scale_y_comma(trans = "log10", position = "right") +
coord_flip() +
guides(
fill = guide_legend(override.aes = list(alpha = 1))
) +
labs(
x = NULL, y = "Time (nanoseconds)",
fill = "Split width:",
title = "Performance comparison between 'fold' implementations"
) +
theme_ft_rc(grid="X") +
theme(legend.position = "top")
ggplot(bench_df, aes(width, time)) +
ggbeeswarm::geom_quasirandom(
aes(group = expr, fill = expr),
groupOnX = TRUE, shape = 21, color = "white", size = 3, stroke = 0.125, alpha = 1/4
) +
scale_x_discrete(
labels = c(1, 3, 5, 7, 10, 20, 30, 40, "Split/fold width: 70")
) +
scale_y_comma(trans = "log10", position = "right") +
scale_fill_ft() +
coord_flip() +
guides(
fill = guide_legend(override.aes = list(alpha = 1))
) +
labs(
x = NULL, y = "Time (nanoseconds)",
fill = NULL,
title = "Performance comparison between 'fold' implementations"
) +
theme_ft_rc(grid="X") +
theme(legend.position = "top")
The Rcpp version is both faster and more consistent than the other two implementations (though they get faster as the number of string subsetting operations decrease); but, they’re all pretty fast. For an infrequently run process, it might be better to use the base R version purely for simplicity. Despite that fact, I used the Rcpp version to turn the string tree long line into:
FIN
If you have need to “fold” like this how do you currently implement your solution? Found a bug or better way after looking at the code? Drop a note in the comments so you can help others find an optimal solution to their own ‘fold’ing problems.
<script type="text/javascript">
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' };
(function(d, t) {
var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;
s.src = '//cdn.viglink.com/api/vglnk.js';
var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);
}(document, 'script'));
</script>
To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
I’m not saying this is a good idea, but bear with me.
A recent question on Stack Overflow [r] asked why a random forest model was not working as expected. The questioner was working with data from an experiment in which yeast was grown under conditions where (a) the growth rate could be controlled and (b) one of 6 nutrients was limited. Their dataset consisted of 6 rows – one per nutrient – and several thousand columns, with values representing the activity (expression) of yeast genes. Could the expression values be used to predict the limiting nutrient?
The random forest was not working as expected: not one of the nutrients was correctly classified. I pointed out that with only one case for each outcome, this was to be expected – as the random forest algorithm samples a proportion of the rows, no correct predictions are likely in this case. As sometimes happens the question was promptly deleted, which was unfortunate as we could have further explored the problem.
A little web searching revealed that the dataset in question is quite well-known. It’s published in an article titled Coordination of Growth Rate, Cell Cycle, Stress Response, and Metabolic Activity in Yeast and has been transformed into a “tidy” format for use in tutorials, here and here.
As it turns out, there are 6 cases (rows) for each outcome (limiting nutrient), as experiments were performed at 6 different growth rates. Whilst random forests are good for “large p small n” problems, it may be that ~ 5 500 x 36 is pushing the limit somewhat. But you know – let’s just try it anyway.
As ever, code and a report for this blog post can be found at Github.
First, we obtain the tidied Brauer dataset. But in fact we want to “untidy” it again (make wide from long) because for random forest we want observations (n) as rows and variables (p – both predicted and predictors) in columns.
library(tidyverse) library(randomForest) library(randomForestExplainer) brauer2007_tidy <- read_csv("https://4va.github.io/biodatasci/data/brauer2007_tidy.csv")
I’ll show the code for running a random forest without much explanation. I’ll assume you have a basic understanding of how the process works. That is: rows are sampled from the data and used to build many decision trees where variables predict either a continuous outcome variable (regression) or a categorical outcome (classification). The trees are then averaged (for regression values) or the majority vote is taken (for classification) to generate predictions. Individual predictors have an “importance” which is essentially some measure of how much worse the model would be were they not included.
Here we go then. A couple of notes. First, setting a seed for random sampling would not usually be used; it’s for reproducibility here. Second, unless you are specifically using the model to predict outcomes on unseen data, there’s no real need for splitting the data into test and training sets – the procedure is already performing a bootstrap by virtue of the out-of-bag error estimation.
brauer2007_tidy_rf1 <- brauer2007_tidy %>% mutate(systematic_name = gsub("-", "minus", systematic_name), nutrient = factor(nutrient)) %>% select(systematic_name, nutrient, rate, expression) %>% spread(systematic_name, expression, fill = 0) %>% randomForest(nutrient ~ ., data = ., localImp = TRUE, importance = TRUE)
The model seems to have performed quite well:
Call: randomForest(formula = nutrient ~ ., data = ., localImp = TRUE, importance = TRUE) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 74 OOB estimate of error rate: 5.56% Confusion matrix: Ammonia Glucose Leucine Phosphate Sulfate Uracil class.error Ammonia 6 0 0 0 0 0 0.0000000 Glucose 0 6 0 0 0 0 0.0000000 Leucine 0 1 5 0 0 0 0.1666667 Phosphate 0 0 0 6 0 0 0.0000000 Sulfate 0 0 0 0 6 0 0.0000000 Uracil 0 1 0 0 0 5 0.1666667
Let’s look at the expression of the top 20 genes (by variable importance), with respect to growth rate and limiting nutrient.
brauer2007_tidy %>% filter(systematic_name %in% important_variables(brauer2007_tidy_rf1, k = 20)) %>% ggplot(aes(rate, expression)) + geom_line(aes(color = nutrient)) + facet_wrap(~systematic_name, ncol = 5) + scale_color_brewer(palette = "Set2")
Several of those look promising. Let’s select a gene where expression seems to be affected by each of the 6 limiting nutrients and search online resources such as the Saccharomyces Genome Database to see what’s known.
systematic_name | bp | nutrient | search_results |
---|---|---|---|
YHR208W | branched chain family amino acid biosynthesis* | leucine | Pathways – leucine biosynthesis |
YKL216W | ‘de novo’ pyrimidine base biosynthesis | uracil | URA1 – null mutant requires uracil |
YLL055W | biological process unknown | sulfate | Cysteine transporter; null mutant absent utilization of sulfur source |
YLR108C | biological process unknown | phosphate | |
YOR348C | proline catabolism* | ammonia | Proline permease; repressed in ammonia-grown cells |
YOR374W | ethanol metabolism | glucose | Aldehyde dehydrogenase; expression is glucose-repressed |
I’d say the Brauer data “makes sense” for five of those genes. Little is known about the sixth (YLR108C, affected by phosphate limitation).
In summary
Normally, a study like this would start with the genes – identify those that are differentially expressed and then think about the conditions under which differential expression was observed. Here, the process is reversed in a sense: we view the experimental condition as an outcome, rather than a parameter and ask whether it can be “predicted” by other observations.
So whilst not my first choice of method for this kind of study, and despite limited outcome data, random forest does seem to be generating some insights into which genes are affected by nutrient limitation. And at the end of the day: if a method provides insights, isn’t that what data science is for?
Today, there are several platforms available in the industry that aid software developers, data scientists as well as a layman in developing and deploying machine learning models within no time.
Ethics and OKRs, Rewriting Binaries, Diversity of Implementation, and Uber's Metrics Systems
Continue reading Four short links: 26 June 2019.
Happy summer! This week on KDnuggets: Understanding Cloud Data Services; How to select rows and columns in Pandas using [ ], .loc, iloc, .at and .iat; 7 Steps to Mastering Data Preparation for Machine Learning with Python; Examining the Transformer Architecture: The OpenAI GPT-2 Controversy; Data Literacy: Using the Socratic Method; and much more!
To successfully integrate AI and machine learning technologies, companies need to take a more holistic approach toward training their workforce.
In our recent surveys AI Adoption in the Enterprise and Machine Learning Adoption in the Enterprise, we found growing interest in AI technologies among companies across a variety of industries and geographic locations. Our findings align with other surveys and studies—in fact, a recent study by the World Intellectual Patent Office (WIPO) found that the surge in research in AI and machine learning (ML) has been accompanied by an even stronger growth in AI-related patent applications. Patents are one sign that companies are beginning to take these technologies very seriously.
When we asked what held back their adoption of AI technologies, respondents cited a few reasons, including some that pertained to culture, organization, and skills:
Implementing and incorporating AI and machine learning technologies will require retraining across an organization, not just technical teams. Recall that the rise of big data and data science necessitated a certain amount of retraining across an entire organization: technologists and analysts needed to familiarize themselves with new tools and architectures, but business experts and managers also needed to reorient their workflows to adjust to data-driven processes and data-intensive systems. AI and machine learning will require a similar holistic approach to training. Here are a few reasons why:
At our upcoming Artificial Intelligence conferences in San Jose and London, we have assembled a roster of two-day training sessions, tutorials, and presentations to help individuals (across job roles and functions) sharpen their skills and understanding of AI and machine learning. We return to San Jose with a two-day Business Summit designed specifically for executives, business leaders, and strategists. This Business Summit includes a popular two-day training—AI for Managers—and tutorials—Bringing AI into the enterprise and Design Thinking for AI—along with 12 executive briefings designed to provide in-depth overviews into important topics in AI. We are also debuting a new half-day tutorial that will be taught by Ira Cohen (Product management in the Machine Learning era), which given the growing importance of AI and ML, is one that every manager should consider attending.
We will also have our usual strong slate of technical training, tutorials, and talks. Here are some two-day training sessions and tutorials that I am excited about:
AI and ML are going to impact and permeate most aspects of a company’s operations, products, and services. To succeed in implementing and incorporating AI and machine learning technologies, companies need to take a more holistic approach toward retraining their workforces. This will be an ongoing endeavor as research results continue to be translated into practical systems that companies can use. Individuals will need to continue to learn new skills as technologies continue to evolve and because many areas of AI and ML are increasingly becoming democratized.
Related training and tutorial links:
Continue reading AI and machine learning will require retraining your entire organization.
• Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning
• Innovating HR Using an Expert System for Recruiting IT Specialists — ESRIT
• Deep Learning for Spatio-Temporal Data Mining: A Survey
• From Fully Supervised to Zero Shot Settings for Twitter Hashtag Recommendation
• Recurrent U-Net for Resource-Constrained Segmentation
• Power Gradient Descent
• Table-Based Neural Units: Fully Quantizing Networks for Multiply-Free Inference
• The König Graph Process
• Position-aware Graph Neural Networks
• Medium-Term Load Forecasting Using Support Vector Regression, Feature Selection, and Symbiotic Organism Search Optimization
• ADASS: Adaptive Sample Selection for Training Acceleration
• UnLimited TRAnsfers for Multi-Modal Route Planning: An Efficient Solution
• Weighted, Bipartite, or Directed Stream Graphs for the Modeling of Temporal Networks
• A Closer Look at the Optimization Landscapes of Generative Adversarial Networks
• Reinforcement Learning for Integer Programming: Learning to Cut
• On regularization for a convolutional kernel in neural networks
• Communication-Efficient Accurate Statistical Estimation
• Model Testing for Generalized Scalar-on-Function Linear Models
• Does BLEU Score Work for Code Migration?
• Non-Parametric Calibration for Classification
• Relative Hausdorff Distance for Network Analysis
• Pay Attention to Convolution Filters: Towards Fast and Accurate Fine-Grained Transfer Learning
• Neural Variational Inference For Estimating Uncertainty in Knowledge Graph Embeddings
• Real-time Attention Based Look-alike Model for Recommender System
• Secure Federated Matrix Factorization
• Kaskade: Graph Views for Efficient Graph Analytics
• Polynomial root clustering and explicit deflation
• Trip Table Estimation and Prediction for Dynamic Traffic Assignment Applications
• Generalized Langevin equations for systems with local interactions
• Efficient Graph Rewriting
• Phase-field material point method for dynamic brittle fracture with isotropic and anisotropic surface energy
• Cooper pairing of incoherent electrons: an electron-phonon version of the Sachdev-Ye-Kitaev model
• Joint 3D Localization and Classification of Space Debris using a Multispectral Rotating Point Spread Function
• Unsupervised Discovery of Gendered Language through Latent-Variable Modeling
• PerspectroScope: A Window to the World of Diverse Perspectives
• Deep 2FBSDEs for Systems with Control Multiplicative Noise
• Estimating the Number of Fatal Victims of the Peruvian Internal Armed Conflict, 1980-2000: an application of modern multi-list Capture-Recapture techniques
• The Prolog debugger and declarative programming
• Enumerating linear systems on graphs
• Deep Forward-Backward SDEs for Min-max Control
• A Systematic Comparison of English Noun Compound Representations
• Issues with post-hoc counterfactual explanations: a discussion
• Distribution-Free Multisample Test Based on Optimal Matching with Applications to Single Cell Genomics
• Generating Pareto optimal dose distributions for radiation therapy treatment planning
• Path Cohomology of Locally Finite Digraphs,Hodge’s Theorem and the $p$-Lazy Random Walk
• Second-best Beam-Alignment via Bayesian Multi-Armed Bandits
• Stability of Graph Scattering Transforms
• Stein’s method and the distribution of the product of zero mean correlated normal random variables
• Mathematical and numerical analysis of a nonlocal Drude model in nanoplasmonics
• Optimum LoRaWAN Configuration Under Wi-SUN Interference
• Active Distribution Grids offering Ancillary Services in Islanded and Grid-connected Mode
• A note on extensions of multilinear maps defined on multilinear varieties
• Suppressing Model Overfitting for Image Super-Resolution Networks
• Lyapunov Differential Equation Hierarchy and Polynomial Lyapunov Functions for Switched Linear Systems
• The EAS approach for graphical selection consistency in vector autoregression models
• Towards Inverse Reinforcement Learning for Limit Order Book Dynamics
• Generalized Beta Prime Distribution: Stochastic Model of Economic Exchange and Properties of Inequality Indices
• Weakly-supervised Compositional FeatureAggregation for Few-shot Recognition
• Relaxed random walks at scale
• Unmasking Bias in News
• Toward Best Practices for Explainable B2B Machine Learning
• Edge-Direct Visual Odometry
• Similarity Problems in High Dimensions
• A mnotone data augmentation algorithm for longitudinal data analysis via multivariate skew-t, skew-normal or t distributions
• Discrepancy, Coresets, and Sketches in Machine Learning
• Task-Aware Deep Sampling for Feature Generation
• Modified log-Sobolev inequality for a compact PJMP with degenerate jumps
• Fast Trajectory Optimization via Successive Convexification for Spacecraft Rendezvous with Integer Constraints
• Homological Connectivity in Čech Complexes
• Statistical guarantees for local graph clustering
• Semi-flat minima and saddle points by embedding neural networks to overparameterization
• An ultraweak formulation of the Reissner-Mindlin plate bending model and DPG approximation
• Nearly Finitary Matroids
• Spectral Ratio for Positive Matrices
• Visual Relationships as Functions: Enabling Few-Shot Scene Graph Prediction
• Analytic-geometric methods for finite Markov chains with applications to quasi-stationarity
• Gambler’s ruin estimates on finite inner uniform domains
• Multiple instance learning with graph neural networks
• Run-Time Efficient RNN Compression for Inference on Edge Devices
• Using Small Proxy Datasets to Accelerate Hyperparameter Search
• Adaptive Navigation Scheme for Optimal Deep-Sea Localization Using Multimodal Perception Cues
• High-resolution Markov state models for the dynamics of Trp-cage miniprotein constructed over slow folding modes identified by state-free reversible VAMPnets
• Compressive Hyperspherical Energy Minimization
• Efficient and Accurate Estimation of Lipschitz Constants for Deep Neural Networks
• Coresets for Gaussian Mixture Models of Any Shape
• Prophet Inequalities on the Intersection of a Matroid and a Graph
• Improving Importance Weighted Auto-Encoders with Annealed Importance Sampling
• SPoC: Search-based Pseudocode to Code
• All-Weather Deep Outdoor Lighting Estimation
• Bilateral Boundary Control Design for a Cascaded Diffusion-ODE System Coupled at an Arbitrary Interior Point
• Lifestate: Event-Driven Protocols and Callback Control Flow
• Transferrable Operative Difficulty Assessment in Robot-assisted Teleoperation: A Domain Adaptation Approach
• Partial Or Complete, That’s The Question
• CogCompTime: A Tool for Understanding Time in Natural Language Text
• Joint Reasoning for Temporal and Causal Relations
• A Structured Learning Approach to Temporal Relation Extraction
• Semi-Supervised Exploration in Image Retrieval
• A Stratified Approach to Robustness for Randomly Smoothed Classifiers
• Hand Orientation Estimation in Probability Density Form
• Towards Geocoding Spatial Expressions
• Synthesizing Diverse Lung Nodules Wherever Massively: 3D Multi-Conditional GAN-based CT Image Augmentation for Object Detection
• Good Stabilizer Codes from Quasi-Cyclic Codes over $\mathbb{F}_4$ and $\mathbb{F}_9$
• Spectral Bounds for Quasi-Twisted Codes
• A hierarchical Lyapunov-based cascade adaptive control scheme for lower-limb exoskeleton
• Polynomially growing harmonic functions on connected groups
• CDPM: Convolutional Deformable Part Models for Person Re-identification
• DeepSquare: Boosting the Learning Power of Deep Convolutional Neural Networks with Elementwise Square Operators
• Unsupervised Question Answering by Cloze Translation
• Indoor image representation by high-level semantic features
• Incremental Learning from Scratch for Task-Oriented Dialogue Systems
• Structure learning of Bayesian networks involving cyclic structures
• Energy Efficient Massive MIMO Array Configurations
• Convergence of partial sum processes to stable processes with application for aggregation of branching processes
• Adversarial Learning of Privacy-Preserving Text Representations for De-Identification of Medical Records
• On Universal Codes for Integers: Wallace Tree, Elias Omega and Variations
• Approximating the Orthogonality Dimension of Graphs and Hypergraphs
• Adaptive Resource Management for a Virtualized Computing Platform within Edge Computing
• BiSET: Bi-directional Selective Encoding with Template for Abstractive Summarization
• Assuring the Evolvability of Microservices: Insights into Industry Practices and Challenges
• Structure-adaptive manifold estimation
• Deep Reinforcement Learning for Unmanned Aerial Vehicle-Assisted Vehicular Networks
• Two-stage Stochastic Lot-sizing Problem with Chance-constrained Condition in the Second Stage
• Graph Embedding on Biomedical Networks: Methods, Applications, and Evaluations
• Checkpoint/restart approaches for a thread-based MPI runtime
• Towards Big data processing in IoT: Path Planning and Resource Management of UAV Base Stations in Mobile-Edge Computing System
• Who Will Win It? An In-game Win Probability Model for Football
• Fast Task Inference with Variational Intrinsic Successor Features
• Decoupling Gating from Linearity
• Desingularization of matrix equations employing hypersingular integrals in boundary element methods using double nodes
• Activated Random Walks on $\mathbb{Z}^d$
• Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets
• Concept Discovery through Information Extraction in Restaurant Domain
• On the WalkerMaker-WalkerBreaker games
• Torus computed tomography
• Selecting stock pairs for pairs trading while incorporating lead-lag relationship
• Higher-Order Ranking and Link Prediction: From Closing Triangles to Closing Higher-Order Motifs
• Probing Multilingual Sentence Representations With X-Probe
• Unified Semantic Parsing with Weak Supervision
• Power-law Verification for Event Detection at Multi-spatial Scales from Geo-tagged Tweet Streams
• Deep Smoothing of the Implied Volatility Surface
• Polynomial-time Updates of Epistemic States in a Fragment of Probabilistic Epistemic Argumentation (Technical Report)
• Fault-Tolerant Path-Embedding of Twisted Hypercube-Like Networks THLNs
• De Finetti’s control problem with Parisian ruin for spectrally negative Lévy processes
• Leveraging Labeled and Unlabeled Data for Consistent Fair Binary Classification
• Reinforcement-Learning-based Adaptive Optimal Control for Arbitrary Reference Tracking
• Applying economic measures to lapse risk management with machine learning approaches
• Broadcasts on Paths and Cycles
• Migrating large codebases to C++ Modules
• Optimizing city-scale traffic flows through modeling isolated observations of vehicle movements
• Knowledge Gradient for Selection with Covariates: Consistency and Computation
• Odd cycles in subgraphs of sparse pseudorandom graphs
• On the Universal Near-Shortest Simple Paths Problem
• Pose from Shape: Deep Pose Estimation for Arbitrary 3D Objects
• Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function
• Model Predictive Control, Cost Controllability, and Homogeneity
• A Survey of Autonomous Driving: Common Practices and Emerging Technologies
• Convergence of second-order, entropy stable methods for multi-dimensional conservation laws
• LED2Net: Deep Illumination-aware Dehazing with Low-light and Detail Enhancement
• Biased random k-SAT
• High Accuracy Classification of White Blood Cells using TSLDA Classifier and Covariance Features
• Handel: Practical Multi-Signature Aggregation for Large Byzantine Committees
• Markov-modulated continuous-time Markov chains to identify site- and branch-specific evolutionary variation
• Next-to$^k$ leading log expansions by chord diagrams
• A second order analysis of McKean-Vlasov semigroups
• Sharp thresholds for nonlinear Hamiltonian cycles in hypergraphs
• A decentralized trust-aware collaborative filtering recommender system based on weighted items for social tagging systems
• Recognizing Manipulation Actions from State-Transformations
• Putting words in context: LSTM language models and lexical ambiguity
• Exploring Bayesian approaches to eQTL mapping through probabilistic programming
• Collaborative Broadcast in O(log log n) Rounds
• Asymptotic approach for backward stochastic differential equation with singular terminal condition *
• Evaluation of Dataflow through layers of Deep Neural Networks in Classification and Regression Problems
• Learning High-Dimensional Gaussian Graphical Models under Total Positivity without Tuning Parameters
• General Video Game Rule Generation
• Stereoscopic Omnidirectional Image Quality Assessment Based on Predictive Coding Theory
• Attention-based Multi-Input Deep Learning Architecture for Biological Activity Prediction: An Application in EGFR Inhibitors
• A Passivity Enforcement Technique for Forced Oscillation Source Location
• DCEF: Deep Collaborative Encoder Framework for Unsupervised Clustering
• UAV Swarms as Amplify-and-Forward MIMO Relays
• Empowering Quality Diversity in Dungeon Design with Interactive Constrained MAP-Elites
• Small-Support Uncertainty Principles on $\mathbb{Z}/p$ over Finite Fields
Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning
Innovating HR Using an Expert System for Recruiting IT Specialists — ESRIT
Deep Learning for Spatio-Temporal Data Mining: A Survey
From Fully Supervised to Zero Shot Settings for Twitter Hashtag Recommendation
Recurrent U-Net for Resource-Constrained Segmentation
Table-Based Neural Units: Fully Quantizing Networks for Multiply-Free Inference
Position-aware Graph Neural Networks
ADASS: Adaptive Sample Selection for Training Acceleration
UnLimited TRAnsfers for Multi-Modal Route Planning: An Efficient Solution
Weighted, Bipartite, or Directed Stream Graphs for the Modeling of Temporal Networks
A Closer Look at the Optimization Landscapes of Generative Adversarial Networks
Reinforcement Learning for Integer Programming: Learning to Cut
On regularization for a convolutional kernel in neural networks
Communication-Efficient Accurate Statistical Estimation
Model Testing for Generalized Scalar-on-Function Linear Models
Does BLEU Score Work for Code Migration?
Non-Parametric Calibration for Classification
Relative Hausdorff Distance for Network Analysis
Pay Attention to Convolution Filters: Towards Fast and Accurate Fine-Grained Transfer Learning
Neural Variational Inference For Estimating Uncertainty in Knowledge Graph Embeddings
Real-time Attention Based Look-alike Model for Recommender System
Secure Federated Matrix Factorization
Kaskade: Graph Views for Efficient Graph Analytics
AI technology has a long history which is actively and constantly changing and growing. It focuses on intelligent agents, which contain devices that perceive the environment and based on which takes actions in order to maximize goal success chances. In this paper, we will explain the modern AI basics and various representative applications of AI. In the context of the modern digitalized world, AI is the property of machines, computer programs, and systems to perform the intellectual and creative functions of a person, independently find ways to solve problems, be able to draw conclusions and make decisions. Most artificial intelligence systems have the ability to learn, which allows people to improve their performance over time. The recent research on AI tools, including machine learning, deep learning and predictive analysis intended toward increasing the planning, learning, reasoning, thinking and action taking ability. Based on which, the proposed research intends towards exploring on how the human intelligence differs from the artificial intelligence. Moreover, we critically analyze what AI of today is capable of doing, why it still cannot reach human intelligence and what are the open challenges existing in front of AI to reach and outperform human level of intelligence. Furthermore, it will explore the future predictions for artificial intelligence and based on which potential solution will be recommended to solve it within next decades. Artificial Intelligence and its Role in Near Future
“You should embrace the Bayesian approach.” Kamil Bartocha ( 26. Apr 2015 )
“The uninspected (inevitably) deteriorates.” Dwight David Eisenhower
** Nuit Blanche is now on Twitter: @NuitBlog **
Recent advances in neural architecture search (NAS) demand tremendous computational resources, which makes it difficult to reproduce experiments and imposes a barrier-to-entry to researchers without access to large-scale computation. We aim to ameliorate these problems by introducing NAS-Bench-101, the first public architecture dataset for NAS research. To build NAS-Bench-101, we carefully constructed a compact, yet expressive, search space, exploiting graph isomorphisms to identify 423k unique convolutional architectures. We trained and evaluated all of these architectures multiple times on CIFAR-10 and compiled the results into a large dataset of over 5 million trained models. This allows researchers to evaluate the quality of a diverse range of models in milliseconds by querying the pre-computed dataset. We demonstrate its utility by analyzing the dataset as a whole and by benchmarking a range of architecture optimization algorithms.
Human-computer Interaction (HCI) is an interdisciplinary research field involving multiple disciplines, such as computer science, psychology, social science and design. It studies the interaction between users and computer in order to better design technologies and solve real-life problems. This position paper characterizes HCI research in China by comparing it with international HCI research traditions. We discuss the current streams and methodologies of Chinese HCI research. We then propose future HCI research directions such as including emergent users who have less access to technology and addressing the cultural dimensions in order to provide better technical solutions and support. Characterizing HCI Research in China: Streams, Methodologies and Future Directions
Easy Computation of Marketing Metrics with Different Analysis Axis (mmetrics)
Provides a mechanism for easy computation of marketing metrics. By default in this package, metrics for digital marketing (e.g. CTR (Click Through Rate …
D3 Dynamic Cluster Visualizations (klustR)
Used to create dynamic, interactive ‘D3.js’ based parallel coordinates and principal component plots in ‘R’. The plots make visualizing k-means or othe …
Store and Retrieve Data.frames in a Git Repository (git2rdata)
Make versioning of data.frame easy and efficient using git repositories.
Uplift Model Evaluation with Plots and Metrics (uplifteval)
Provides a variety of plots and metrics to evaluate uplift models including the ‘R uplift’ package’s Qini metric and Qini plot, a port of the ‘python p …
• Sharing of vulnerability information among companies — a survey of Swedish companies
• Evolution of ROOT package management
• Microservices Migration in Industry: Intentions, Strategies, and Challenges
• Hierarchical Taxonomy-Aware and Attentional Graph Capsule RCNNs for Large-Scale Multi-Label Text Classification
• A Document-grounded Matching Network for Response Selection in Retrieval-based Chatbots
• Incremental Classifier Learning Based on PEDCC-Loss and Cosine Distance
• Coupled Variational Recurrent Collaborative Filtering
• Online Learning and Planning in Partially Observable Domains without Prior Knowledge
• Improving Reproducible Deep Learning Workflows with DeepDIVA
• Self-Supervised Learning for Contextualized Extractive Summarization
• Modeling the Past and Future Contexts for Session-based Recommendation
• Unsupervised Minimax: Adversarial Curiosity, Generative Adversarial Networks, and Predictability Minimization
• The snippets taxonomy in web search engines
• Modeling Sentiment Dependencies with Graph Convolutional Networks for Aspect-level Sentiment Classification
• Simultaneously Learning Architectures and Features of Deep Neural Networks
• Learning robust visual representations using data augmentation invariance
• Anomaly Detection in High Performance Computers: A Vicinity Perspective
• On Stabilizing Generative Adversarial Training with Noise
• Faster Algorithms for High-Dimensional Robust Covariance Estimation
• Learning Selection Masks for Deep Neural Networks
• Fast and Accurate Least-Mean-Squares Solvers
• Graph Convolutional Transformer: Learning the Graphical Structure of Electronic Health Records
• Large Scale Structure of Neural Network Loss Landscapes
• What Kind of Language Is Hard to Language-Model?
• eSLAM: An Energy-Efficient Accelerator for Real-Time ORB-SLAM on FPGA Platform
• Characterising hyperbolic hyperplanes of a non-singular quadric in $PG(4,q)$
• Customizing Pareto Simulated Annealing for Multi-objective Optimization of Control Cabinet Layout
• Advertising in an oligopoly with differentiated goods under general demand and cost functions: A differential game approach
• A Combination of Temporal Sequence Learning and Data Description for Anomaly-based NIDS
• 274-GHz CMOS Signal Generator with an On-Chip Patch Antenna in a QFN Package
• The Performance Of Convolutional Coding Based Cooperative Communication: Relay Position And Power Allocation Analysis
• Preferred Design of Hierarchical Distribution Matching
• A Unified Definition and Computation of Laplacian Spectral Distances
• Convergence analysis of a Crank-Nicolson Galerkin method for an inverse source problem for parabolic systems with boundary observations
• DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections
• Low Rank Approximation Directed by Leverage Scores and Computed at Sub-linear Cost
• Inferring 3D Shapes from Image Collections using Adversarial Networks
• DeepcomplexMRI: Exploiting deep residual network for fast parallel MR imaging with complex convolution
• Hybrid Function Sparse Representation towards Image Super Resolution
• Representation Learning-Assisted Click-Through Rate Prediction
• Evaluation of Seed Set Selection Approaches and Active Learning Strategies in Predictive Coding
• Beam Learning — Using Machine Learning for Finding Beam Directions
• Maximum Mean Discrepancy Gradient Flow
• Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning
• Recognizing License Plates in Real-Time
• PAN: Projective Adversarial Network for Medical Image Segmentation
• Band Attention Convolutional Networks For Hyperspectral Image Classification
• Window Based BFT Blockchain Consensus
• DoubleTransfer at MEDIQA 2019: Multi-Source Transfer Learning for Natural Language Understanding in the Medical Domain
• Indecomposable $0$-Hecke modules for extended Schur functions
• Different Approaches for Human Activity Recognition: A Survey
• Approximate Gradient Descent Convergence Dynamics for Adaptive Control on Heterogeneous Networks
• Schwinger-Dyson and loop equations for a product of square Ginibre random matrices
• A Method to construct all the Paving Matroids over a Finite Set
• Subspace Attack: Exploiting Promising Subspaces for Query-Efficient Black-box Attacks
• Lightweight and Efficient Neural Natural Language Processing with Quaternion Networks
• Numerical computations of split Bregman method for fourth order total variation flow
• Detection and estimation of parameters in high dimensional multiple change point regression models via $\ell_1/\ell_0$ regularization and discrete optimization
• Probabilistic Forecasting with Temporal Convolutional Neural Network
• An approximate Bayesian approach to regression estimation with many auxiliary variables
• Symmetric multisets of permutations
• Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
• Classification of EEG Signals using Genetic Programming for Feature Construction
• iProStruct2D: Identifying protein structural classes by deep learning via 2D representations
• Few-Shot Point Cloud Region Annotation with Human in the Loop
• Quantum Random Numbers generated by the Cloud Superconducting Quantum Computer
• Evolutionary Trigger Set Generation for DNN Black-Box Watermarking
• Learning a Matching Model with Co-teaching for Multi-turn Response Selection in Retrieval-based Dialogue Systems
• Vulnerabilities of Power System Operations to Load Forecasting Data Injection Attacks
• Deep learning analysis of cardiac CT angiography for detection of coronary arteries with functionally significant stenosis
• NAS-FCOS: Fast Neural Architecture Search for Object Detection
• Macro-action Multi-timescale Dynamic Programming for Energy Management with Phase Change Materials
• Behavioral Switching Loss Modeling of Inverter Modules
• Upper envelopes of families of Feller semigroups and viscosity solutions to a class of nonlinear Cauchy problems
• On the Vector Space in Photoplethysmography Imaging
• On the Universality of Noiseless Linear Estimation with Respect to the Measurement Matrix
• Metrics for Learning in Topological Persistence
• New loop expansion for the Random Magnetic Field Ising Ferromagnets at zero temperature
• Survey of Artificial Intelligence for Card Games and Its Application to the Swiss Game Jass
• Orthogonal Cocktail BPSK: Exceeding Shannon Capacity of QPSK Input
• A Novel Cost Function for Despeckling using Convolutional Neural Networks
• Single Image Blind Deblurring Using Multi-Scale Latent Structure Prior
• Elephant random walks with delays
• Bag of Color Features For Color Constancy
• Reinforcement Learning of Minimalist Numeral Grammars
• Reinforcement Learning for Channel Coding: Learned Bit-Flipping Decoding
• Almost Optimal Semi-streaming Maximization for k-Extendible Systems
• Quantifying Intrinsic Uncertainty in Classification via Deep Dirichlet Mixture Networks
• Compressed Sensing MRI via a Multi-scale Dilated Residual Convolution Network
• Continual Reinforcement Learning deployed in Real-life using Policy Distillation and Sim2Real Transfer
• Classification of Radio Signals and HF Transmission Modes with Deep Learning
• TW-SMNet: Deep Multitask Learning of Tele-Wide Stereo Matching
• Cross-Modal Relationship Inference for Grounding Referring Expressions
• Rearrangement operations on unrooted phylogenetic networks
• Rate-Splitting Unifying SDMA, OMA, NOMA, and Multicasting in MISO Broadcast Channel: A Simple Two-User Rate Analysis
• Causal Discovery with Reinforcement Learning
• Efficient structure learning with automatic sparsity selection for causal graph processes
• EXmatcher: Combining Features Based on Reference Strings and Segments to Enhance Citation Matching
• Optimizing Pipelined Computation and Communication for Latency-Constrained Edge Learning
• Two-dimensional partial cubes
• Competing (Semi)-Selfish Miners in Bitcoin
• Approximate Variational Inference Based on a Finite Sample of Gaussian Latent Variables
• BasisConv: A method for compressed representation and learning in CNNs
• Asymptotic Guarantees for Learning Generative Models with the Sliced-Wasserstein Distance
• Off-equilibrium computation of the dynamic critical exponent of the three-dimensional Heisenberg model
• A Linear Algorithm for Minimum Dominator Colorings of Orientations of Paths
• LocLets: Localized Graph Wavelets for Processing Frequency Sparse Signals on Graphs
• Three product formulas for ratios of tiling counts of hexagons with collinear holes
• Identities involving Schur functions and their applications to a shuffling theorem
• WikiDataSets : Standardized sub-graphs from WikiData
• Translation hyperovals and $\mathbb{F}_2$-linear sets of pseudoregulus type
• Statistical Species Identification
• A Graph-theoretic Method to Define any Boolean Operation on Partitions
• A refined primal-dual analysis of the implicit bias
• Fast Rates for a kNN Classifier Robust to Unknown Asymmetric Label Noise
• Principled Training of Neural Networks with Direct Feedback Alignment
• Counterfactual Data Augmentation for Mitigating Gender Stereotypes in Languages with Rich Morphology
• Challenges in Time-Stamp Aware Anomaly Detection in Traffic Videos
• k-Nearest Neighbor Optimization via Randomized Hyperstructure Convex Hull
• Joint Subspace Recovery and Enhanced Locality Driven Robust Flexible Discriminative Dictionary Learning
• A proof of the mean-field limit for $λ$-convex potentials by $Γ$-Convergence
• Mimic and Fool: A Task Agnostic Adversarial Attack
• Monte Carlo and Quasi-Monte Carlo Density Estimation via Conditioning
• Dynamical Anatomy of NARMA10 Benchmark Task
• Dual-Band Fading Multiple Access Relay Channels
• Adaptive Neural Signal Detection for Massive MIMO
• New dynamic and verifiable multi-secret sharing schemes based on LFSR public key cryptosystem
• Regional economic convergence and spatial quantile regression
• Retrieve, Read, Rerank: Towards End-to-End Multi-Document Reading Comprehension
• Achieving competitive advantage in academia through early career coauthorship with top scientists
• An explicit characterization of arc-transitive circulants
• ROOT I/O compression algorithms and their performance impact within Run 3
• Bias-Aware Inference in Fuzzy Regression Discontinuity Designs
• Scale Invariant Fully Convolutional Network: Detecting Hands Efficiently
• Hierarchical multiscale finite element method for multi-continuum media
• Distance Matrix of a Class of Completely Positive Graphs: Determinant and Inverse
• Creation of User Friendly Datasets: Insights from a Case Study concerning Explanations of Loan Denials
• Learning Symmetries of Classical Integrable Systems
• A Proximal Point Dual Newton Algorithm for Solving Group Graphical Lasso Problems
• Analysis of Optimization Algorithms via Sum-of-Squares
• `Project & Excite’ Modules for Segmentation of Volumetric Medical Scans
• Gated CRF Loss for Weakly Supervised Semantic Image Segmentation
• Journal Name Extraction from Japanese Scientific News Articles
• Deep learning control of artificial avatars in group coordination tasks
• Mesh adaptivity for quasi-static phase-field fractures based on a residual-type a posteriori error estimator
• Residual estimates for post-processors in elliptic problems
• Stable Rank Normalization for Improved Generalization in Neural Networks and GANs
• Two-step Constructive Approaches for Dungeon Generation
• Control contribution identifies top driver nodes in complex networks
• Extracting Interpretable Concept-Based Decision Trees from CNNs
• Characterization and valuation of uncertainty of calibrated parameters in stochastic decision models
• Matricial characterization of tournaments with maximum number of diamonds
• Areas of triangles and SL_2 actions in finite rings
• A Taxonomy of Channel Pruning Signals in CNNs
• StRE: Self Attentive Edit Quality Prediction in Wikipedia
• Data-Driven Model Predictive Control with Stability and Robustness Guarantees
• Efficient Kernel-based Subsequence Search for User Identification from Walking Activity
• A Hybrid Approach Between Adversarial Generative Networks and Actor-Critic Policy Gradient for Low Rate High-Resolution Image Compression
• Stability and Metastability of Traffic Dynamics in Uplink Random Access Networks
• Inter-sentence Relation Extraction with Document-level Graph Convolutional Neural Network
• Generating Summaries with Topic Templates and Structured Convolutional Decoders
• An Improved Analysis of Training Over-parameterized Deep Neural Networks
• Hybrid Nonlinear Observers for Inertial Navigation Using Landmark Measurements
• On Single Source Robustness in Deep Fusion Models
• Rethinking Person Re-Identification with Confidence
• Variance-reduced $Q$-learning is minimax optimal
• Causal Inference in Higher Education: Building Better Curriculums
• HEAD-QA: A Healthcare Dataset for Complex Reasoning
• Membership-based Manoeuvre Negotiation in Autonomous and Safety-critical Vehicular Systems
• Generative adversarial network for segmentation of motion affected neonatal brain MRI
• Using Structured Representation and Data: A Hybrid Model for Negation and Sentiment in Customer Service Conversations
• Communication and Memory Efficient Testing of Discrete Distributions
• ProPublica’s COMPAS Data Revisited
• Automatic brain tissue segmentation in fetal MRI using convolutional neural networks
• 3-D Surface Segmentation Meets Conditional Random Fields
• Asymptotic analysis of exit time for dynamical systems with a single well potential
• The $h^*$-polynomials of locally anti-blocking lattice polytopes and their $γ$-positivity
• Data-Free Quantization through Weight Equalization and Bias Correction
• Clouds of Oriented Gradients for 3D Detection of Objects, Surfaces, and Indoor Scene Layouts
• Shapes and Context: In-the-Wild Image Synthesis & Manipulation
A Document-grounded Matching Network for Response Selection in Retrieval-based Chatbots
Incremental Classifier Learning Based on PEDCC-Loss and Cosine Distance
Coupled Variational Recurrent Collaborative Filtering
Online Learning and Planning in Partially Observable Domains without Prior Knowledge
Improving Reproducible Deep Learning Workflows with DeepDIVA
Self-Supervised Learning for Contextualized Extractive Summarization
Modeling the Past and Future Contexts for Session-based Recommendation
The snippets taxonomy in web search engines
Simultaneously Learning Architectures and Features of Deep Neural Networks
Learning robust visual representations using data augmentation invariance
Anomaly Detection in High Performance Computers: A Vicinity Perspective
On Stabilizing Generative Adversarial Training with Noise
Faster Algorithms for High-Dimensional Robust Covariance Estimation
Learning Selection Masks for Deep Neural Networks
Fast and Accurate Least-Mean-Squares Solvers
Graph Convolutional Transformer: Learning the Graphical Structure of Electronic Health Records
Large Scale Structure of Neural Network Loss Landscapes
What Kind of Language Is Hard to Language-Model?
Emotion dynamics is the study of how emotions change over time. Sometimes our feelings are quite stable, but other times capricious. Measuring and predicting these patterns for different people is somewhat of a Holy Grail for emotion researchers. In particular, some researchers are aspiring to discover mathematical laws that capture the complexity of our inner emotional experiences – much like physicists divining the laws that govern objects in the natural environment. These discoveries would revolutionize our understanding of our everyday feelings and when our emotions can go awry.
This series of blog posts, which I kicked off earlier this month with a simulation of emotions during basketball games, is inspired by researchers like Peter Kuppens and Tom Hollenstein (to name a few) who have collected and analyzed reams of intensive self-reports on people’s feelings from one moment to the next. My approach is to reverse engineer these insights and generate models that simulate emotions evolving over time – like this:
We start with the affective state space – the theoretical landscape on which our conscious feelings roam free. This space is represented as two-dimensional, although we acknowledge that this fails to capture all aspects of conscious feeling. The first dimension, represented along the x-axis, is valence and this refers to how unpleasant vs. pleasant we feel. The second dimension, represented along the y-axis, is arousal. Somewhat less intuitive, arousal refers to how deactivated/sluggish/sleepy vs. activated/energized/awake we feel. At any time, our emotional state can be defined in terms of valence and arousal. So if you’re feeling stressed you would be low in valence and high in arousal. Let’s say you’re serene and calm, then you would be high in valence and low in arousal. Most of the time, we feel moderately high valence and moderate arousal (i.e., content), but if you’re the type of person who is chronically stressed, this would be different.
This is all well and good when we think about how we’re feeling right now, but it’s also worth considering how our emotions are changing. On a regular day, our emotions undergo minor fluctuations – sometimes in response to minor hassles or victories, and sometimes for no discernible reason. In this small paragraph, I’ve laid out a number of parameters, all of which vary between different people:
We’ll keep all of this in mind for the simulation. We’ll start with a fairly simple simulation with 100 hypothetical people. We’ll need the following packages.
library(psych)
library(tidyverse)
library(sn)
And then we’ll create a function that performs the simulation. Note that each person i has their own attractor, recovery rate, stability, and dispersion. For now we’ll just model random fluctuations in emotions, a sort of Brownian motion. You can imagine our little simulatons (fun name for the hypothetical people in the simulation) sitting around on an average day doing nothing in particular.
simulate_affect <- function(n = 2, time = 250, negative_event_time = NULL) {
dt <- data.frame(matrix(nrow = time, ncol = 1))
colnames(dt) <- "time"
dt$time <- 1:time
valence <- data.frame(matrix(nrow = time, ncol = 0))
arousal <- data.frame(matrix(nrow = time, ncol = 0))
for(i in 1:n) {
attractor_v <- rnorm(1, mean = 3.35, sd = .75)
instability_v <- sample(3:12, 1, replace = TRUE, prob = c(.18, .22, .18, .15, .8, .6, .5, .4, .2, .1))
dispersion_v <- abs(rsn(1, xi = .15, omega = .02, alpha = -6) * instability_v) #rsn simulates a skewed distribution.
if(!is.null(negative_event_time)) {
recovery_rate <- sample(1:50, 1, replace = TRUE) + negative_event_time
negative_event <- (dt$time %in% negative_event_time:recovery_rate) * seq.int(50, 1, -1)
}
else {
negative_event <- 0
}
valence[[i]] <- ksmooth(x = dt$time,
y = (negative_event * -.10) + arima.sim(list(order = c(1, 0, 0),
ar = .50),
n = time),
bandwidth = time/instability_v, kernel = "normal")$y * dispersion_v + attractor_v
#instability is modelled in the bandwidth term of ksmooth, such that higher instability results in higher bandwidth (greater fluctuation).
#dispersion scales the white noise (arima) parameter, such that there are higher peaks and troughs at higher dispersion.
attractor_a <- rnorm(1, mean = .50, sd = .75) + sqrt(instability_v) #arousal attractor is dependent on instability. This is because high instability is associated with higher arousal states.
instability_a <- instability_v + sample(-1:1, 1, replace = TRUE)
dispersion_a <- abs(rsn(1, xi = .15, omega = .02, alpha = -6) * instability_a)
arousal[[i]] <- ksmooth(x = dt$time,
y = (negative_event * .075) + arima.sim(list(order = c(1, 0, 0),
ar = .50),
n = time),
bandwidth = time/instability_a, kernel = "normal")$y * dispersion_a + attractor_a
}
valence[valence > 6] <- 6
valence[valence < 0] <- 0
arousal[arousal > 6] <- 6
arousal[arousal < 0] <- 0
colnames(valence) <- paste0("valence_", 1:n)
colnames(arousal) <- paste0("arousal_", 1:n)
dt <- cbind(dt, valence, arousal)
return(dt)
}
set.seed(190625)
emotions <- simulate_affect(n = 100, time = 300)
emotions %>%
select(valence_1, arousal_1) %>%
head()
## valence_1 arousal_1
## 1 1.328024 5.380643
## 2 1.365657 5.385633
## 3 1.401849 5.390470
## 4 1.436284 5.395051
## 5 1.468765 5.399162
## 6 1.499062 5.402752
So we see the first six rows for participant 1’s valence and arousal. But if we want to plot these across multiple simulatons, we need to wrangle the data into long form. We’ll also compute some measures of within-person deviation. The Root Mean Square Successive Difference (RMSSD) takes into account gradual shifts in the mean. Those who are more emotionally unstable will have a higher RMSSD. For two dimensions (valence and arousal) we’ll just compute the mean RMSSD.
emotions_long <- emotions %>%
gather(key, value, -time) %>%
separate(key, into = c("dimension", "person"), sep = "_") %>%
spread(dimension, value) %>%
group_by(person) %>%
mutate(rmssd_v = rmssd(valence),
rmssd_a = rmssd(arousal),
rmssd_total = mean(rmssd_v + rmssd_a)) %>%
ungroup()
Let’s see what this looks like for valence and arousal individually.
emotions_long %>%
ggplot(aes(x = time, y = valence, group = person, color = rmssd_v)) +
geom_line(size = .75, alpha = .75) +
scale_color_gradient2(low = "black", mid = "grey", high = "red", midpoint = median(emotions_long$rmssd_v)) +
labs(x = "Time",
y = "Valence",
color = "Instability",
title = "Simulated Valence Scores over Time for 100 People") +
theme_minimal(base_size = 16)
emotions_long %>%
ggplot(aes(x = time, y = arousal, group = person, color = rmssd_a)) +
geom_line(size = .75, alpha = .75) +
scale_color_gradient2(low = "black", mid = "grey", high = "red", midpoint = median(emotions_long$rmssd_a)) +
labs(x = "Time",
y = "Arousal",
color = "Instability",
title = "Simulated Arousal Scores over Time for 100 People") +
theme_minimal(base_size = 16)
We see that some lines are fairly flat and others fluctuate more widely. More importantly, most people are somewhere in the middle.
We can get a sense of one simulated person’s affective state space as well. The goal here is to mimic the kinds of models shown in Kuppens, Oravecz, and Tuerlinckx (2010):
emotions_long %>%
filter(person %in% sample(1:100, 6, replace = FALSE)) %>%
ggplot(aes(x = valence, y = arousal, group = person)) +
geom_path(size = .75) +
scale_x_continuous(limits = c(0, 6)) +
scale_y_continuous(limits = c(0, 6)) +
labs(x = "Valence",
y = "Arousal",
title = "Affective State Space for Six Randomly Simulated People") +
facet_wrap(~person) +
theme_minimal(base_size = 18) +
theme(plot.title = element_text(size = 18, hjust = .5))
To really appreciate what’s going on, we need to animate this over time. I’ll add some labels to the affective state space so that it’s easier to interpret what one might be feeling at that time. I’ll also add color to show which individuals are more unstable according to RMSSD.
library(gganimate)
p <- emotions_long %>%
ggplot(aes(x = valence, y = arousal, color = rmssd_total)) +
annotate("text", x = c(1.5, 4.5, 1.5, 4.5), y = c(1.5, 1.5, 4.5, 4.5), label = c("Gloomy", "Calm", "Anxious", "Happy"),
size = 10, alpha = .50) +
annotate("rect", xmin = 0, xmax = 3, ymin = 0, ymax = 3, alpha = 0.25, color = "black", fill = "white") +
annotate("rect", xmin = 3, xmax = 6, ymin = 0, ymax = 3, alpha = 0.25, color = "black", fill = "white") +
annotate("rect", xmin = 0, xmax = 3, ymin = 3, ymax = 6, alpha = 0.25, color = "black", fill = "white") +
annotate("rect", xmin = 3, xmax = 6, ymin = 3, ymax = 6, alpha = 0.25, color = "black", fill = "white") +
geom_point(size = 3.5) +
scale_color_gradient2(low = "black", mid = "grey", high = "red", midpoint = median(emotions_long$rmssd_total)) +
scale_x_continuous(limits = c(0, 6)) +
scale_y_continuous(limits = c(0, 6)) +
labs(x = "Valence",
y = "Arousal",
color = "Instability",
title = 'Time: {round(frame_time)}') +
transition_time(time) +
theme_minimal(base_size = 18)
ani_p <- animate(p, nframes = 320, end_pause = 20, fps = 16, width = 550, height = 500)
ani_p
Our simulation does a pretty good job at emulating the natural ebb and flow of emotions, but we know that emotions can be far more volatile. Let’s subject our simulation to a negative event. Perhaps all 100 simulatons co-authored a paper that just got rejected. In the function simulate_affect, there’s an optional argument negative_event_time that causes a negative event to occur at the specified time. For this, we need to consider one more emotion dynamics parameter:
So we’ll run the simulation with a negative event arising at t = 150. The negative event will cause a downward spike in valence and an upward spike in arousal.
emotions_event <- simulate_affect(n = 100, time = 300, negative_event_time = 150)
emotions_event_long <- emotions_event %>%
gather(key, value, -time) %>%
separate(key, into = c("dimension", "person"), sep = "_") %>%
spread(dimension, value) %>%
group_by(person) %>%
mutate(rmssd_v = rmssd(valence),
rmssd_a = rmssd(arousal),
rmssd_total = mean(rmssd_v + rmssd_a)) %>%
ungroup()
emotions_event_long %>%
ggplot(aes(x = time, y = valence, group = person, color = rmssd_v)) +
geom_line(size = .75, alpha = .75) +
scale_color_gradient2(low = "black", mid = "grey", high = "red", midpoint = median(emotions_event_long$rmssd_v)) +
labs(x = "Time",
y = "Valence",
color = "Instability",
title = "Simulated Valence Scores over Time for 100 People") +
theme_minimal(base_size = 16)
emotions_event_long %>%
ggplot(aes(x = time, y = arousal, group = person, color = rmssd_a)) +
geom_line(size = .75, alpha = .75) +
scale_color_gradient2(low = "black", mid = "grey", high = "red", midpoint = median(emotions_event_long$rmssd_a)) +
labs(x = "Time",
y = "Arousal",
color = "Instability",
title = "Simulated Arousal Scores over Time for 100 People") +
theme_minimal(base_size = 16)
It’s pretty clear that something bad happened. Of course, some of our simulatons are unflappable, but most experienced a drop in valence and spike in arousal that we might identify as anxiety. Again, let’s visualize this evolving over time. Pay close attention to when the timer hits 150.
p2 <- emotions_event_long %>%
ggplot(aes(x = valence, y = arousal, color = rmssd_total)) +
annotate("text", x = c(1.5, 4.5, 1.5, 4.5), y = c(1.5, 1.5, 4.5, 4.5), label = c("Gloomy", "Calm", "Anxious", "Happy"),
size = 10, alpha = .50) +
annotate("rect", xmin = 0, xmax = 3, ymin = 0, ymax = 3, alpha = 0.25, color = "black", fill = "white") +
annotate("rect", xmin = 3, xmax = 6, ymin = 0, ymax = 3, alpha = 0.25, color = "black", fill = "white") +
annotate("rect", xmin = 0, xmax = 3, ymin = 3, ymax = 6, alpha = 0.25, color = "black", fill = "white") +
annotate("rect", xmin = 3, xmax = 6, ymin = 3, ymax = 6, alpha = 0.25, color = "black", fill = "white") +
geom_point(size = 3.5) +
scale_color_gradient2(low = "black", mid = "grey", high = "red", midpoint = median(emotions_event_long$rmssd_total)) +
scale_x_continuous(limits = c(0, 6)) +
scale_y_continuous(limits = c(0, 6)) +
labs(x = "Valence",
y = "Arousal",
color = "Instability",
title = 'Time: {round(frame_time)}') +
transition_time(time) +
theme_minimal(base_size = 18)
ani_p2 <- animate(p2, nframes = 320, end_pause = 20, fps = 16, width = 550, height = 500)
ani_p2
The overall picture is that some are more emotionally resilient than others. As of now, all the simulatons return to their baseline attractor, but we would realistically expect some to stay stressed or gloomy following bad news. In the coming months I’ll be looking into how to incorporate emotion regulation into the simulation. For example, maybe some of the simulatons use better coping strategies than others? I’m also interested in incorporating appraisal mechanisms that allow for different reactions depending on the type of emotional stimulus.
In our next MünsteR R-user group meetup on Tuesday, July 9th, 2019, we will have two exciting talks about Word2Vec Text Mining & Parallelization in R!
You can RSVP here: https://www.meetup.com/de-DE/Munster-R-Users-Group/events/262236134/
Thorben Hellweg will talk about Parallelization in R. More information tba!
Maren Reuter from viadee AG will give an introduction into the functionality and use of the Word2Vec algorithm in R.
Text data in its raw form cannot be used as input for machine learning algorithms. Therefore, an information extraction method is required to process plain text into an appropriate representation. By exploiting the semantic and syntactic structure of the text data, the importance of a word can be defined and represented as a vector in a vector space. I.e. the vector can be seen as a numerical „importance“ value. There exist two predominant approaches to represent words as vectors: Either by using the word frequency (ngrams), or by using a prediction model to estimate the relatedness of words. The Word2Vec algorithm by Mikolov et al. belongs to the latter one. This talk will show the functionality of the algorithm and how it can be used in practice.
Maren Reuter is an IT-Consultant at viadee AG and part of the company’s Artificial Intelligence research group. She got her Master’s degree in Information Systems at the University of Münster with a focus in Data Analytics. In her Master thesis she dealt with text mining techniques to predict maintenance tasks in agile software projects. For this purpose, she used the Word2Vec algorithm to build a word vector representation model.
Earnings calls suggests that company executives are fed up with Mr Trump’s tariffs and trade wars
This week we return to Australian Rules Football, the R package fitzRoy and some statistics to ask – why can’t Geelong win after a bye?
(with apologies to long-time readers who used to come for the science)
Code and a report for this blog post are available at Github.
First, some background. In 2011 the AFL expanded from 16 to 17 teams with the addition of the Gold Coast Suns. In the same year, a bye round (a week where some teams don’t play) was reintroduced to the competition. For the purposes of this discussion, we are interested only in bye rounds since 2011, and during the regular home/away season.
You will often hear footy fans claim – sometimes with very little evidence – that “we don’t go well after the bye.” For one team, this is certainly true. That team is Geelong, who have not won a game in the round following a bye since Round 7 in 2011.
Is this unusual? If so, does the available game data suggest any reason?
We start as ever with the excellent fitzRoy package and use get_match_results()
to – well, get the match results.
Next, we can use some tidyverse
magic to obtain all games in the round immediately before, and after, a bye. This looks long and complicated, so here’s an version with annotations in the comments to explain what’s going on:
results_bye <- results %>% # choose the desired columns select(Season, Round, Date, Venue, Home.Team, Away.Team, Margin) %>% # create one column for teams, another to indicate whether home or away gather(Status, Team, -Season, -Round, -Margin, -Date, -Venue) %>% # filter for 2011 onwards and only home/away games filter(Season > 2010, grepl("^R", Round)) %>% # create a column with the number of each round separate(Round, into = c("prefix", "suffix"), sep = 1) %>% mutate(suffix = as.numeric(suffix)) %>% # for each team's games in a season find games # the week before and after a bye arrange(Season, Team, suffix) %>% group_by(Season, Team) %>% mutate(bye = case_when( suffix - lead(suffix) == -2 ~ "before", suffix - lag(suffix) == 2 ~ "after", TRUE ~ as.character(suffix) ), # margins are with respect to home team so negate them if away Margin = ifelse(Status == "Away.Team", -Margin, Margin)) %>% ungroup() %>% # filter for the pre- and post-bye games filter(bye %in% c("before", "after")) %>% # calculate result mutate(Result = case_when( Margin > 0 ~ "W", Margin < 0 ~ "L", TRUE ~ "D" )) %>% # recreate the Round column unite(Round, prefix, suffix, sep = "")
Let’s confirm that Geelong have not won after a bye in a long time:
results_bye %>% filter(Team == "Geelong", bye == "after")
Season | Round | Date | Venue | Margin | Status | Team | bye | Result |
---|---|---|---|---|---|---|---|---|
2011 | R7 | 2011-05-07 | Kardinia Park | 66 | Home.Team | Geelong | after | W |
2011 | R23 | 2011-08-27 | Kardinia Park | -13 | Home.Team | Geelong | after | L |
2012 | R13 | 2012-06-22 | S.C.G. | -6 | Away.Team | Geelong | after | L |
2013 | R13 | 2013-06-23 | Gabba | -5 | Away.Team | Geelong | after | L |
2014 | R9 | 2014-05-17 | Subiaco | -32 | Away.Team | Geelong | after | L |
2016 | R16 | 2016-07-08 | Kardinia Park | -38 | Home.Team | Geelong | after | L |
2017 | R13 | 2017-06-15 | Subiaco | -13 | Away.Team | Geelong | after | L |
2018 | R15 | 2018-06-29 | Docklands | -2 | Away.Team | Geelong | after | L |
2019 | R14 | 2019-06-22 | Adelaide Oval | -11 | Away.Team | Geelong | after | L |
How does that compare with other teams?
We see all combinations: teams that seem to win more after a bye, as well as teams that win less and teams for which a bye makes no difference. However, Geelong certainly has the worst post-bye win/loss record.
We can ask: is the win/loss count in pre-bye games significantly different to those post-bye? One approach to this is to construct 2×2 contingency tables and perform Fisher’s exact test.
With some more tidyverse magic we can nest the data for each team, generate the tests and summarise the results. This approach is explained very nicely in “Running a model on separate groups” over at Simon Jackson’s blog.
Only Geelong has p < 0.05, suggesting that there is something interesting about the win/loss count after the bye. We’ll just show the first 5 teams here.
results_bye %>% count(Team, bye, Result) %>% nest(-Team) %>% mutate(data = map(data, . %>% spread(Result, n) %>% select(2:3)), fisher = map(data, fisher.test), summary = map(fisher, tidy)) %>% select(Team, summary) %>% unnest() %>% select(-method, -alternative) %>% arrange(p.value) %>% pander(split.table = Inf)
Team | estimate | p.value | conf.low | conf.high |
---|---|---|---|---|
Geelong | 21.4 | 0.01522 | 1.533 | 1396 |
Sydney | 5.43 | 0.1698 | 0.6027 | 79.83 |
North Melbourne | 0.1736 | 0.2941 | 0.002835 | 2.438 |
Richmond | 3.68 | 0.3469 | 0.4059 | 43.34 |
Collingwood | 3.719 | 0.3498 | 0.4048 | 53.81 |
We can extend the previous visualisation by further breaking down games into home and away:
Now we see that of Geelong’s 8 post-bye losses, 6 were away games. Port Adelaide have a similar record. Then again, Brisbane have not won an away game before the bye, but you don’t hear anyone talking about Brisbane “not going well before the bye”.
When we look at those 6 away post-bye losses, one was in Melbourne – which in terms of travel distance is not very far from Geelong. The other five were “genuine” away games in Sydney, Brisbane, Adelaide and Perth (2).
Season | Round | Date | Venue | Margin | Status | Team | bye | Result |
---|---|---|---|---|---|---|---|---|
2012 | R13 | 2012-06-22 | S.C.G. | -6 | Away.Team | Geelong | after | L |
2013 | R13 | 2013-06-23 | Gabba | -5 | Away.Team | Geelong | after | L |
2014 | R9 | 2014-05-17 | Subiaco | -32 | Away.Team | Geelong | after | L |
2017 | R13 | 2017-06-15 | Subiaco | -13 | Away.Team | Geelong | after | L |
2018 | R15 | 2018-06-29 | Docklands | -2 | Away.Team | Geelong | after | L |
2019 | R14 | 2019-06-22 | Adelaide Oval | -11 | Away.Team | Geelong | after | L |
In addition, three of the losses were against a side also coming off the bye, but playing at home.
Season | Round | Date | Venue | Margin | Status | Team | bye | Result |
---|---|---|---|---|---|---|---|---|
2012 | R13 | 2012-06-22 | S.C.G. | -6 | Away.Team | Geelong | after | L |
2014 | R9 | 2014-05-17 | Subiaco | -32 | Away.Team | Geelong | after | L |
2017 | R13 | 2017-06-15 | Subiaco | -13 | Away.Team | Geelong | after | L |
What about away games before the bye? One loss in Melbourne, four wins in Melbourne and one win in Sydney, versus the GWS Giants who at that time were a new and struggling team.
Season | Round | Date | Venue | Margin | Status | Team | bye | Result |
---|---|---|---|---|---|---|---|---|
2011 | R5 | 2011-04-26 | M.C.G. | 19 | Away.Team | Geelong | before | W |
2011 | R21 | 2011-08-14 | Football Park | 11 | Away.Team | Geelong | before | W |
2012 | R11 | 2012-06-08 | Docklands | 12 | Away.Team | Geelong | before | W |
2013 | R11 | 2013-06-08 | Sydney Showground | 59 | Away.Team | Geelong | before | W |
2016 | R14 | 2016-06-25 | Docklands | -3 | Away.Team | Geelong | before | L |
2019 | R12 | 2019-06-07 | M.C.G. | 67 | Away.Team | Geelong | before | W |
Our last question: for games after a bye, what was the expected result? By expected we mean “according to the bookmakers”. We can join the match results with historical betting data, assign the expected result (win or loss) to Geelong according to their odds, then compare expected versus actual results. This reveals that six of the eight post-bye losses were unexpected – not surprising as Geelong has been a strong team in the period from 2011 to now.
bye | Result | Expected | n |
---|---|---|---|
after | L | L | 2 |
after | L | W | 6 |
after | W | W | 1 |
before | L | L | 1 |
before | L | W | 1 |
before | W | L | 1 |
before | W | W | 6 |
In summary
Historically, Geelong do seem more prone to losing after a bye round than other teams, and those losses have been unexpected in terms of betting odds.
However, a large proportion of their post-bye losses have been interstate away games, versus strong opponents. Away games before the bye have been either in Melbourne, or versus weaker opponents.
Scheduling may therefore have played a role in Geelong’s post-bye win/loss record.
Paper: A Case for Backward Compatibility for Human-AI Teams
Article: Speech2Face: Learning the Face Behind a Voice
Article: A.I. Ethics Boards Should Be Based on Human Rights
Article: Diversity in IT: The Business and Moral Reasons
Article: The Hitchhiker’s Guide to AI Ethics Part 3: What AI Does & Its Impact
Article: Fahrenheit 2050: the Future Shock of AI & Machine Learning
Article: AI’s (Auto)Complete Control Over Humanity
Article: The origins of bias and how AI may be the answer to ending its reign
Reader Antonio R. forwarded a tweet about the following "waterfall of pie charts" to me:
Maarten Lamberts loved these charts (source: here).
I am immediately attracted to the visual thinking behind this chart. The data are presented in a hierarchy with three levels. The levels are nested in the sense that the pieces in each pie chart add up to 100%. From the first level to the second, the category of freshwater is sub-divided into three parts. From the second level to the third, the "others" subgroup under freshwater is sub-divided into five further categories.
The designer faces a twofold challenge: presenting the proportions at each level, and integrating the three levels into one graphic. The second challenge is harder to master.
The solution here is quite ingenious. A waterfall/waterdrop metaphor is used to link each layer to the one below. It visually conveys the hierarchical structure.
***
There remains a little problem. There is a confusion related to the part and the whole. The link between levels should be that one part of the upper level becomes the whole of the lower level. Because of the color scheme, it appears that the part above does not account for the entirety of the pie below. For example, water in lakes is plotted on both the second and third layers while water in soil suddenly enters the diagram at the third level even though it should be part of the "drop" from the second layer.
***
I started playing around with various related forms. I like the concept of linking the layers and want to retain it. Here is one graphic inspired by the waterfall pies from above:
Machine Learning: Dimensionality Reduction via Linear Discriminant Analysis
Natural Language Interface to DataTable
Data Literacy: Using the Socratic Method
Designing Tools and Activities for Data Literacy Learners
A Data and Analytics Leader’s Guide to Data Literacy
Being right matters : model-compliant events in predictive processing
The Divergence Index: A new polarization measure for ordinal categorical variables
Disposable Technology: A Concept Whose Time Has Come
Deep Knowledge: Next Step After Deep Learning
How to Increase the Impact of Your Machine Learning Model
Five Command Line Tools for Data Science
Genetic Artificial Neural Networks
Inferring New Relationships using the Probabilistic Soft Logic
Probabilistic Soft Logic (PSL)
You Are Having a Relationship With a Chatbot!
Distributed Deep Learning Pipelines with PySpark and Keras
This workshop should be really interesting:
Silviu Paun and Dirk Hovy are co-organizing it. They’re very organized and know this area as well as anyone. I’m on the program committee, but won’t be able to attend.
I really like the problem of crowdsourcing. Especially for machine learning data curation. It’s a fantastic problem that admits of really nice Bayesian hierarchical models (no surprise to this blog’s audience!).
The rest of this note’s a bit more personal, but I’d very much like to see others adopting similar plans for the future for data curation and application.
The past
Crowdsourcing is near and dear to my heart as it’s the first serious Bayesian modeling problem I worked on. Breck Baldwin and I were working on crowdsourcing for applied natural language processing in the mid 2000s. I couldn’t quite figure out a Bayesian model for it by myself, so I asked Andrew if he could help. He invited me to the “playroom” (a salon-like meeting he used to run every week at Columbia), where he and Jennifer Hill helped me formulate a crowdsourcing model.
As Andrew likes to say, every good model was invented decades ago for psychometrics, and this one’s no different. Phil Dawid had formulated exactly the same model (without the hierarchical component) back in 1979, estimating parameters with EM (itself only published in 1977). The key idea is treating the crowdsourced data like any other noisy measurement. Once you do that, it’s just down to details.
Part of my original motivation for developing Stan was to have a robust way to fit these models. Hamiltonian Monte Carlo (HMC) only handles continuous parameters, so like in Dawid’s application of EM, I had to marginalize out the discrete parameters. This marginalization’s the key to getting these models to sample effectively. Sampling discrete parameters that can be marginalized is a mug’s game.
The present
Coming full circle, I co-authored a paper with Silviu and Dirk recently, Comparing Bayesian models of annotation, that reformulated and evaluated a bunch of these models in Stan.
Editorial Aside: Every field should move to journals like TACL. Free to publish, fully open access, and roughly one month turnarond to first decision. You have to experience journals like this in action to believe it’s possible.
The future
I want to see these general techniques applied to creating probabilistic corpora, to online adaptative training data (aka active learning), to joint corpus inference and model training (a la Raykar et al.’s models), and to evaluation.
P.S. Cultural consensus theory
I’m not the only one who recreated Dawid and Skene’s model. It’s everywhere these days.
Recently, I just discovered an entire literature dating back decades on cultural consensus theory, which uses very similar models (I’m pretty sure either Lauren Kennedy or Duco Veen pointed out the literature). The authors go more into the philosophical underpinnings of the notion of consensus driving these models (basically the underlying truth of which you are taking noisy measurements). One neat innovation in the cultural consensus theory literature is a mixture model of truth—you can assume multiple subcultures are coding the data with different standards. I’d thought of mixture models of coders (say experts, Mechanical turkers, and undergrads), but not of the truth.
In yet another small world phenomenon, right after I discovered cultural consensus theory, I saw a cello concert organized through Groupmuse by a social scientist at NYU I’d originally met through a mutual friend of Andrew’s. He introduced the cellist, Iona Batchelder, and added as an aside she was the daughter of well known social scientists. Not just any social scientists, the developers of cultural consensus theory!
Last week I had the honor of giving a 1 hour talk about Choroplethr at a private company.
When this company reached out to me about speaking there, I originally planned to give the same talk I gave at CDC two years ago.
But as I reviewed the CDC talk, I realized two things:
Because of this, I decided to rewrite the talk from scratch.
I wanted to share this new and improved resource with a wider audience, so I just recorded myself giving this new talk. You can view the talk below.
I hope this helps you get the most out of Choroplethr!
Interested in having me give this talk at your company? Contact me and let me know!
The post New Resource for Learning Choroplethr appeared first on AriLamstein.com.