My Data Science Blogs

December 16, 2018

Surprise-hacking: “the narrative of blindness and illusion sells, and therefore continues to be the central thesis of popular books written by psychologists and cognitive scientists”

Teppo Felin sends along this article with Mia Felin, Joachim Krueger, and Jan Koenderink on “surprise-hacking,” and writes:

We essentially see surprise-hacking as the upstream, theoretical cousin of p-hacking. Though, surprise-hacking can’t be resolved with replication, more data or preregistration. We use perception and priming research to make these points (linking to Kahneman and priming, Simons and Chabris’s famous gorilla study and its interpretation, etc).

We think surprise-hacking implicates theoretical issues that haven’t meaningfully been touched on – at least in the limited literatures that we are aware of (mostly in cog sci, econ, psych). Though, there are probably related literatures out there (which you are very likely to know) – so I’m curious if you are aware of papers in other domains that deal with this or related issues?

I think the point that Felin et al. are making is that results obtained under conditions of surprise might not generalize to normal conditions. The surprise in the experiment is typically thought of as a mechanism for isolating some phenomenon—part of the design of the experiment—but arguably is it one of the conditions of the experiment as well. Thus, the conclusion of a study conducted under surprise should not be, “People show behavior X,” but rather, “People show behavior X under a condition of surprise.”

Regarding Felin’s question to me: I am not aware of any discussion of this issue in the political science literature, but maybe there’s something out there, or perhaps something related? All I can think of right now is experiments on public opinion and voting, where there is some discussion of relevance of isolated experiments to real-world behavior when people are subject to many influences.

I’ll conclude with a line from Felin et al.’s paper:

The narrative of blindness and illusion sells, and therefore continues to be the central thesis of popular books written by psychologists and cognitive scientists.


I’m reminded of the two modes of reasoning in pop-microeconomics: (1) People are rational and respond to incentives. Behavior that looks irrational is actually completely rational once you think like an economist, or (2) People are irrational and they need economists, with their open minds, to show them how to be rational and efficient.

They get you coming and going, and the common thread is that they know best. The message is that we are all foolish fools and we need the experts’ expertise for life-hacks that will change our lives.

If we step back a bit further, we can associate this with a general approach to social science, or science in general, which is to focus on “puzzles” or anomalies to our existing theories. From a Popperian/Lakatosian perspective, it makes sense to gnaw on puzzles and to study the counterintuitive. The point, though, is that the blindness and illusion is a property of researchers—after all, the point is to investigate phenomena that don’t fit with our scientific models of the world—as of the people being studied. It’s not so much that people are predictably irrational, but that existing scientific theories are wrong in some predictable ways.

The post Surprise-hacking: “the narrative of blindness and illusion sells, and therefore continues to be the central thesis of popular books written by psychologists and cognitive scientists” appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…

Collapse

Read More

R Packages worth a look

A Parser for ‘ArchieML’ (rchie)
Parses the ‘ArchieML’ format from the New York Times <http://archieml.org>. Also provides ut …

A ‘ggplot2’ Extension to Make Normal Violin Plots (ggnormalviolin)
Uses ‘ggplot2’ to create normally distributed violin plots with specified means and standard deviations. This function can be useful in showing hypothe …

Sparse Multi-Type Regularized Feature Modeling (smurf)
Implementation of the SMuRF algorithm of Devriendt et al. (2018) <arXiv:1810.03136> to fit generalized linear models (GLMs) with multiple types o …

Projection Pursuit Based on Gaussian Mixtures and Evolutionary Algorithms (ppgmmga)
Projection Pursuit (PP) algorithm for dimension reduction based on Gaussian Mixture Models (GMMs) for density estimation using Genetic Algorithms (GAs) …

Continue Reading…

Collapse

Read More

Magister Dixit

“Experience is not only the best teacher, but also perhaps the only teacher.” Daniel Tunkelang

Continue Reading…

Collapse

Read More

Document worth reading: “Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences”

This paper is an attempt to bridge the conceptual gaps between researchers working on the two widely used approaches based on positive definite kernels: Bayesian learning or inference using Gaussian processes on the one side, and frequentist kernel methods based on reproducing kernel Hilbert spaces on the other. It is widely known in machine learning that these two formalisms are closely related; for instance, the estimator of kernel ridge regression is identical to the posterior mean of Gaussian process regression. However, they have been studied and developed almost independently by two essentially separate communities, and this makes it difficult to seamlessly transfer results between them. Our aim is to overcome this potential difficulty. To this end, we review several old and new results and concepts from either side, and juxtapose algorithmic quantities from each framework to highlight close similarities. We also provide discussions on subtle philosophical and theoretical differences between the two approaches. Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences

Continue Reading…

Collapse

Read More

December 15, 2018

linl 0.0.3: Micro release

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Our linl package for writing LaTeX letter with (R)markdown had a fairly minor release today, following up on the previous release well over a year ago. This version just contains one change which Mark van der Loo provided a few months ago with a clean PR. As another user was just bitten the same issue when using an included letterhead – which was fixed but unreleased – we decided it was time for a release. So there it is.

linl makes it easy to write letters in markdown, with some extra bells and whistles thanks to some cleverness chiefly by Aaron.

Here is screenshot of the vignette showing the simple input for some moderately fancy output:

The NEWS entry follows:

Changes in linl version 0.0.3 (2018-12-15)

  • Correct LaTeX double loading of package color with different options (Mark van der Loo in #18 fixing #17).

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the linl page. For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

Data Scientist’s Dilemma – The Cold Start Problem

The ancient philosopher Confucius has been credited with saying “study your past to know your future.” This wisdom applies not only to life but to machine learning also. Specifically, the availability and application of labeled data (things past) for the labeling of previously unseen data (things future) is fundamental to supervised machine learning.

Without labels (diagnoses, classes, known outcomes) in past data, then how do we make progress in labeling (explaining) future data? This would be a problem.

A related problem also arises in unsupervised machine learning. In these applications, there is no requirement or presumption regarding the existence of labeled training data — we are essentially parameterizing or characterizing the patterns in the data (e.g., the trends, correlations, segments, clusters, associations).

Many unsupervised learning models can converge more readily and be more valuable if we know in advance which parameterizations are best to choose. If we cannot know that (i.e., because it truly is unsupervised learning), then we would like to know at least that our final model is optimal (in some way) in explaining the data.

In both of these applications (supervised and unsupervised machine learning), if we don’t have these initial insights and validation metrics, then how does such model-building get started and get moving towards the optimal solution?

This challenge is known as the cold-start problem! The solution to the problem is easy (sort of): We make a guess — an initial guess! Usually, that would be a totally random guess.

That sounds so… so… random! How do we know whether it’s a good initial guess? How do we progress our model (parameterizations) from that random initial choice? How do we know that our progression is moving towards more accurate models? How? How? How?

This can be a real challenge. Of course nobody said the “cold start” problem would be easy. Anyone who has ever tried to start a very cold car on a frozen morning knows the pain of a cold start challenge. Nothing can be more frustrating on such a morning. But, nothing can be more exhilarating and uplifting on such a morning than that moment when the engine starts and the car begins moving forward with increasing performance.

The experiences for data scientists who face cold-start problems in machine learning can be very similar to those, especially the excitement when our models begin moving forward with increasing performance.

We will itemize several examples at the end. But before we do that, let’s address the objective function. That is the true key that unlocks performance in a cold-start challenge.  That’s the magic ingredient in most of the examples that we will list.

The objective function (also known as cost function, or benefit function) provides an objective measure of model performance. It might be as simple as the percentage of class labels that the model got right (in a classification model), or the sum of the squares of the deviations of the points from the model curve (in a regression model), or the compactness of the clusters relative to their separation (in a clustering analysis).

The value of the objective function is not only in its final value (i.e., giving us a quantitative overall model performance rating), but it’s great (perhaps greatest) value is realized in guiding our progression from the initial random model (cold-start zero point) to that final successful (hopefully, optimal) model. In those intermediate steps it serves as an evaluation (or validation) metric.

By measuring the evaluation metric at step zero (cold-start), then measuring it again after making adjustments to the model parameters, we learn whether our adjustments led to a better performing model or worse performance. We then know whether to continue making model parameter adjustments in the same direction or in the opposite direction. This is called gradient descent.

Gradient descent methods basically find the slope (i.e., the gradient) of the performance error curve as we progress from one model to the next. As we learned in grade school algebra class, we need two points to find the slope of a curve. Therefore, it is only after we have run and evaluated two models that we will have two performance points — the slope of the curve at the latest point then informs our next choice of model parameter adjustments: either (a) keep adjusting in the same direction as the previous step (if the performance error decreased) to continue descending the error curve; or (b) adjust in the opposite direction (if the performance error increased) to turn around and start descending the error curve.

Note that hill-climbing is the opposite of gradient descent, but essentially the same thing. Instead of minimizing error (a cost function), hill-climbing focuses on maximizing accuracy (a benefit function). Again, we measure the slope of the performance curve from two models, then proceed in the direction of better-performing models. In both cases (hill-climbing and gradient descent), we hope to reach an optimal point (maximum accuracy or minimum error), and then declare that to be the best solution. And that is amazing and satisfying when we remember that we started (as a cold-start) with an initial random guess at the solution.

When our machine learning model has many parameters (which could be thousands for a deep neural network), the calculations are more complex (perhaps involving a multi-dimensional gradient calculation, known as a tensor). But the principle is the same: quantitatively discover at each step in the model-building progression which adjustments (size and direction) are needed in each one of the model parameters in order to progress towards the optimal value of the objective function (e.g., minimize errors, maximize accuracy, maximize goodness of fit, maximize precision, minimize false positives, etc.). In deep learning, as in typical neural network models, the method by which those adjustments to the model parameters are estimated (i.e., for each of the edge weights between the network nodes) is called backpropagation. That is still based on gradient descent.

One way to think about gradient descent, backpropagation, and perhaps all machine learning is this: “Machine Learning is the set of mathematical algorithms that learn from experience. Good judgment comes experience. And experience comes from bad judgment.” In our case, the initial guess for our random cold-start model can be considered “bad judgment”, but then experience (i.e., the feedback from validation metrics such as gradient descent) bring “good judgment” (better models) into our model-building workflow.

Here are ten examples of cold-start problems in data science where the algorithms and techniques of machine learning produce the good judgment in model progression toward the optimal solution:

  • Clustering analysis (such as K-Means Clustering), where the initial cluster means and the number of clusters are not known in advance (and thus are chosen randomly initially), but the compactness of the clusters can be used to evaluate, iterate, and improve the set of clusters in a progression to the final optimum set of clusters (i.e., the most compact and best separated clusters).
  • Neural networks, where the initial weights on the network edges are assigned randomly (a cold-start), but backpropagation is used to iterate the model to the optimal network (with highest classification performance).
  • TensorFlow deep learning, which uses the same backpropagation technique of simpler neural networks, but the calculation of the weight adjustments is made across a very high-dimensional parameter space of deep network layers and edge weights using tensors.
  • Regression, which uses the sum of the squares of the deviations of the points from the model curve in order to find the best-fit curve. In linear regression, there is a closed-form solution (derivable from the linear least-squares technique). The solution for non-linear regression is not typically a closed-form set of mathematical equations, but the minimization of the sum of the squares of deviations still applies — gradient descent can be used in an iterative workflow to find the optimal curve. Note that K-Means Clustering is actually an example of piecewise regression.
  • Nonconvex optimization, where the objective function has many hills and valleys, so that gradient descent and hill-climbing will typically converge only to a local optimum, not to the global optimum. Techniques like genetic algorithms, particle swarm optimization (when the gradient cannot be calculated), and other evolutionary computing methods are used to generate lots of random (cold-start) models and then iterate each of them until you find the global optimum (or until you run out of time and resources, and then pick the best one that you could find). [See my graphic attached below that illustrates a sample use case for genetic algorithms.]
  • kNN (k-Nearest Neighbors), which is a supervised learning technique in which the data set itself becomes the model. In other words, the assignment of a new data point to a particular group (which may or may not have a class label or a particular meaning yet) is based simply upon finding which category (group) of existing data points is in the majority when you take a vote of the nearest neighbors to the new data point. The number of nearest neighbors that are to be examined is some number k, which can be initially arbitrary (a cold-start), but then it is adjusted to improve model performance.
  • Naive Bayes classification, which applies Bayes theorem to a large data set with class labels on the data items, but for which some combinations of attributes and features are not represented in the training data (i.e., a cold-start challenge). By assuming that the different attributes are mutually independent features of the data items, then one can estimate the posterior likelihood for what the class label should be for a new data item with a feature vector (set of attributes) that is not found in the training data. This is sometimes called a Bayes Belief Network (BBN) and is another example of where the data set becomes the model, where the frequency of occurrence of the different attributes individually can inform the expected frequency of occurrence of different combinations of the attributes.
  • Markov modeling (Belief Networks for Sequences) is an extension of BBN to sequences, which can include web logs, purchase patterns, gene sequences, speech samples, videos, stock prices, or any other temporal or spatial or parametric sequence.
  • Association rule mining, which searches for co-occurring associations that occur higher than expected from a random sampling of a data set. Association rule mining is yet another example where the data set becomes the model, where no prior knowledge of the associations is known (i.e., a cold-start challenge). This technique is also called Market Basket Analysis, which has been used for simple cold-start customer purchase recommendations, but it also has been used in such exotic use cases as tropical storm (hurricane) intensification prediction.
  • Social network (link) analysis, where the patterns in the network (e.g., centrality, reach, degrees of separation, density, cliques, etc.) encode knowledge about the network (e.g., most authoritative or influential nodes in the network), through the application of algorithms like PageRank, without any prior knowledge about those patterns (i.e., a cold-start).

Finally, as a bonus, we mention a special case, Recommender Engines, where the cold-start problem is a subject of ongoing research. The research challenge is to find the optimal recommendation for a new customer or for a new product that has not been seen before. Check out these articles  related to this challenge:

  1. The Cold Start Problem for Recommender Systems
  2. Tackling the Cold Start Problem in Recommender Systems
  3. Approaching the Cold Start Problem in Recommender Systems

We started this article mentioning Confucius and his wisdom. Here is another form of wisdomhttps://rapidminer.com/wisdom/ — the RapidMiner Wisdom conference. It is a wonderful conference, with many excellent tutorials, use cases, applications, and customer testimonials. I was honored to be the keynote speaker for their 2018 conference in New Orleans, where I spoke about “Clearing the Fog around Data Science and Machine Learning: The Usual Suspects in Some Unusual Places”. You can find my slide presentation here: KirkBorne-RMWisdom2018.pdf 

 

Continue Reading…

Collapse

Read More

Book Memo: “Data Visualization”

A Practical Introduction
This book provides students and researchers a hands-on introduction to the principles and practice of data visualization. It explains what makes some graphs succeed while others fail, how to make high-quality figures from data using powerful and reproducible methods, and how to think about data visualization in an honest and effective way.
Data Visualization builds the reader’s expertise in ggplot2, a versatile visualization library for the R programming language. Through a series of worked examples, this accessible primer then demonstrates how to create plots piece by piece, beginning with summaries of single variables and moving on to more complex graphics. Topics include plotting continuous and categorical variables; layering information on graphics; producing effective ‘small multiple’ plots; grouping, summarizing, and transforming data for plotting; creating maps; working with the output of statistical models; and refining plots to make them more comprehensible.
Effective graphics are essential to communicating ideas and a great way to better understand data. This book provides the practical skills students and practitioners need to visualize quantitative data and get the most out of their research findings.
• Provides hands-on instruction using R and ggplot2
• Shows how the ‘tidyverse’ of data analysis tools makes working with R easier and more consistent
• Includes a library of data sets, code, and functions

Continue Reading…

Collapse

Read More

Request for comments on planned features for futile.logger 1.5

(This article was first published on R – Cartesian Faith, and kindly contributed to R-bloggers)

I will be pushing a new version of futile.logger (version 1.5) to CRAN in January. This version introduces a number of enhancements and fixes some bugs. It will also contain at least one breaking change. I am making the release process public, since the package is now used in a number of other packages. If you use futile.logger, this is your opportunity to influence the direction of the package and prepare for changes. Please use the github issue tracker for discussion.

There are two main themes for enhancements: 1) integration with the signal conditions, and 2) supporting multiple appenders with different  thresholds.

Hijacking the signal system

Currently, futile.logger is unaware of the signal system in R. The only tie-in is with ftry, which catches a warning or error and prints a message. Unfortunately, the behavior is different from try. This release will make ftry consistent with try. This is a breaking change, so if you use ftry in your code, it will no longer halt processing.

In addition to this fix, futile.logger will now have better integration with the signal system. Currently, if a function emits a warning or an error, these are printed to the screen. It would be convenient if futile.logger can capture these signals and associate them to the correct log levels, i.e. warning to WARN and error to ERROR.

My proposal is to create a hijack_signal_handlers function to override the existing signal handlers. This function can be called at the top of a script or within the .onLoad function of a package. Once called, any warnings or errors would be captured and handled by futile.logger. The implementation would look like this, giving granular control of whether to hijack just warning, errors, or both:

hijack_signal_handlers <- function(warning=TRUE, error=TRUE) {
  if (warning) {
# Override warning handler
}
if (error) {
# Override error handler
}
}

One issue I see with this function is when used in a package’s .onLoad. Suppose a user requires package A and B. These don’t use futile.logger. Now the user requires package C, which calls hijack_signal_handlers in its .onLoad. When this occurs, warnings and errors emitted from packages A and B would also be captured by futile.logger. From my perspective, this is probably a good thing, but I can appreciate why others may not want this behavior. 

Emitting signals

The other half of the signal handler puzzle is being able to emit signals from futile.logger. For this case, we want flog.warn to emit a warning signal and flog.error to emit an error signal. One signature looks like

flog.warn(message, emit=TRUE)

meaning that by default no signal is emitted. This would work for flog.warn, flog.error, and flog.fatal (could map to error as well).

I have mixed feelings about this use case. Part of me says if futile.logger hijacks the signal system, then just use stop and futile.logger will catch it. On the other hand that seems roundabout and less capable than writing flog.error(message, emit=TRUE) or whatever.

Really my main concern is what happens when these two systems work together? Will they play nice or will an error be emitted twice (likely)? If not, then there is more logic that has to be built in, which ultimately adds complexity and wastes cycles. Any input here is encouraged!

To leave a comment for the author, please follow the link and comment on their blog: R – Cartesian Faith.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…

Collapse

Read More

If you did not already know

dAIrector google
dAIrector is an automated director which collaborates with humans storytellers for live improvisational performances and writing assistance. dAIrector can be used to create short narrative arcs through contextual plot generation. In this work, we present the system architecture, a quantitative evaluation of design choices, and a case-study usage of the system which provides qualitative feedback from a professional improvisational performer. We present relevant metrics for the understudied domain of human-machine creative generation, specifically long-form narrative creation. We include, alongside publication, open-source code so that others may test, evaluate, and run the dAIrector. …

Adaptive Weights Clustering (AWC) google
This paper presents a new approach to non-parametric cluster analysis called Adaptive Weights Clustering (AWC). The idea is to identify the clustering structure by checking at different points and for different scales on departure from local homogeneity. The proposed procedure describes the clustering structure in terms of weights \( w_{ij} \) each of them measures the degree of local inhomogeneity for two neighbor local clusters using statistical tests of ‘no gap’ between them. % The procedure starts from very local scale, then the parameter of locality grows by some factor at each step. The method is fully adaptive and does not require to specify the number of clusters or their structure. The clustering results are not sensitive to noise and outliers, the procedure is able to recover different clusters with sharp edges or manifold structure. The method is scalable and computationally feasible. An intensive numerical study shows a state-of-the-art performance of the method in various artificial examples and applications to text data. Our theoretical study states optimal sensitivity of AWC to local inhomogeneity. …

Decoupled Learning google
Incorporating encoding-decoding nets with adversarial nets has been widely adopted in image generation tasks. We observe that the state-of-the-art achievements were obtained by carefully balancing the reconstruction loss and adversarial loss, and such balance shifts with different network structures, datasets, and training strategies. Empirical studies have demonstrated that an inappropriate weight between the two losses may cause instability, and it is tricky to search for the optimal setting, especially when lacking prior knowledge on the data and network. This paper gives the first attempt to relax the need of manual balancing by proposing the concept of \textit{decoupled learning}, where a novel network structure is designed that explicitly disentangles the backpropagation paths of the two losses. Experimental results demonstrate the effectiveness, robustness, and generality of the proposed method. The other contribution of the paper is the design of a new evaluation metric to measure the image quality of generative models. We propose the so-called \textit{normalized relative discriminative score} (NRDS), which introduces the idea of relative comparison, rather than providing absolute estimates like existing metrics. …

Continue Reading…

Collapse

Read More

Science and Technology links (December 15th 2018)

  1. Academic excellence is not a strong predictor of career excellence. There is weak correlation between grades and job performance. Grant reviews the evidence in details in his New York Times piece.When recruiting research assistants, I look at grades as the last indicator. I find that imagination, ambition, initiative, curiosity, drive, are far better predictors of someone who will do useful work with me. Of course, these characteristics are themselves correlated with high grades, but there is something to be said about a student who decides that a given course is a waste of time and that he works on a side project instead. Breakthroughs don’t happen in regular scheduled classes, they happen in side projects. We want people who complete the work they were assigned, but we also need people who can reflect critically on what is genuinely important. I don’t have any need for a smart automaton: I already have many computers.I have applied the same principle with my two sons: I do not overly stress the importance of good grades, encouraging them instead to pursue their own interests and to go beyond their classes.
  2. Our hearts do not regenerate. Thus a viable strategy might to transplant brand new hearts from pigs. This is much harder than it appears, however. But progress is being made. Researchers are now able to keep baboons alive for months with transplanted pig hearts. To achieve this good result, the scientists had to use an immunosuppressant drug to prevent unwanted growth in the pig’s heart. With some luck, some of us could benefit from transplanted heart pigs in the near future.
  3. Cataract is the most common cause of blindness. It can be “cured” by removing your natural lens and replacing them with artificial lenses called IOL (intraocular lenses). This therapy was invented in the 1940s, but it took 40 years before it became widespread in wealthy countries. It is still out of reach in many countries. Yet the cost of intraocular lenses is less than 10$ and the procedure is inexpensive (it costs less than 25$ in total in some countries). Even today, in many rich countries, access to this therapy is restricted. Finally, in 2017, a government agency in UK recommended that we stop rationing access to cataract surgery.
  4. Physically fit middle-age women are much less likely to develop dementia (e.g., Alzheimer’s).
  5. You might expect that research results published in more prestigious venues would also be more reliable. Brembs (2018) suggests it works the other way around:

    an accumulating body of evidence suggests the inverse: methodological quality and, consequently, reliability of published research works in several fields may be decreasing with increasing journal rank

    My own recommendation to colleagues and students has been that if peer-reviewed publications are warranted, then it is fine to target serious well-managed venues, irrespective of their “prestige”.

    It is hard enough to do solid research, if you also have to tune it so that it outcompetes other proposals in a competition for prestige, I fear that you may discourage good research practices. Scientists care too little about modesty, it is their downfall.

  6. Lomborg, a reknown economist, writes about climate change:

    Using the best individual and collectively peer-reviewed economic models, the total cost of Paris – through slower GDP growth from higher energy costs – will reach $1-2 trillion every year from 2030. (…) It’s so expensive because green energy isn’t ready to replace fossil fuels at scale. Nations are using expensive subsidies and other policies to force immature green technologies on consumers and businesses. We need to change course. The smart option, backed by economic science, is to adopt a technology-led policy. This means investing far more into green energy research and development. Rather than forcing the rollout of immature energy sources, we need to ensure that green energy can out-compete fossil fuels.

    I really like the term “technology-led policy”. If you want to change the world for the better, then making the good things cheap using technology and science is the golden path.

  7. About 60% of all scientists never lead a research project of their own which indicate that they always play a supporting role. In fields like astronomy, ecology and robotics, half of all researchers leave the field every five years, a consequence of the fact that there are many more aspiring scientists than there are good jobs. Though this sounds bad, but one must consider that the number of scientists doubles every 15 years. Thus even though the job prospects for scientists look poor in relative terms, we must consider that we never had so many gainfully employed scientists.
  8. The state of Louisiana is adopting digital drive licenses. Meanwhile, in Montreal, I still can’t take the subway without constantly recharging a stupid card.
  9. Lack of copper might lead to heart disease. Copper is found in shiitake mushrooms, oysters, dark chocolate, sesame seeds, cashew nuts, raw kale, beans and avocados.
  10. The diabetes drug Metformin is under study as an anti-aging drug. It is believed to be very safe yet Konopka et al. suggests that it may lower the benefits due to exercise.
  11. Over time, our bodies accumulate a small fraction of “senescent cells”. It is believed that these disfunctional cells contribute to the diseases of old age. For the last few years, researchers have been looking for senolytics, drugs that can kill senescent cells. It turns out that two antibiotics approved for medical use are potent senolytics.
  12. The first autonomous vehicule (the ancestor of the self-driving car) was built in 1961.
  13. Billions of dollars have been spent on clinical trials to try to cure Alzheimer’s, all in vain. Golde et al. propose that the problem might have to do with poor timing: we need to apply the therapy at the right time. Wadmam suggests that Alzheimer’s might spread like an infection.
  14. China is introducing far reaching penalties for researchers who commit scientific fraud:

    Chinese leaders have been increasingly focused on scientific misconduct, following ongoing reports of researchers there using fraudulent data, falsifying CVs and faking peer reviews. In May, the government announced sweeping reforms to improve research integrity. One of those was the creation of a national database of misconduct cases. Inclusion on the list could disqualify researchers from future funding or research positions, and might affect their ability to get jobs outside academia. (Source: Nature)

    We need to recognize that the scientific enterprise is fundamentally on an honor-based system. It is trivial to cheat in science. You can work hard to collect data, or make it up as you go. Except for the most extreme cases, the penalty for cheating is small because there is almost always plausible deniability.

Continue Reading…

Collapse

Read More

Six Sigma DMAIC Series in R – Part4

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    Categories

    1. Basic Statistics

    Tags

    1. Data Visualisation
    2. Linear Regression
    3. R Programming

    Hope you liked the Part 1 ,Part 2 and Part 3 of this Series. In this Part 4, we will go through the tools used during the Improve phase of Six Sigma DMAIC cycle. The most representative tool used during the Improve Phase is DOE-Design of experiments. Proper use of DOE can lead to process improvement, but a bad design of experiment can lead to undesired results – inefficiency, higher costs.

    What is experiment

    An experiment is a test or series of tests in which purposeful changes are made to input variables of a process or system so that we may observe and identify the reason for the change in output response.
    Consider the example of the simple process model, shown below, where we have controlled input factors X’s, output Y and uncontrolled factors Z’s.

    The objective of the experiment can be:

    1. Process Characterization: To know the relationship between X, Y & Z.
    2. Process Control: Capture changes in X, so that Y is always near the desired nominal value.
    3. Process Optimization: Capture changes in X, so that the variability of Y is minimized.
    4. Robust Design: Capture changes in X, so that effect of uncontrolled Z is minimized.

    Importance of experiment

    Experiments allow to control the values of the Xs of the process and then measure the value of the Ys to discover what values of the independent variables will allow us to improve the performance of our process. On the contrary, in the case of Observational Studies, we don’t have any influence on the variables we are measuring. We just collect the data and use the appropriate statistical technique.
    There are some risks associated when the analysis is based on the data gathered directly during the normal performance of process: Inconsistent Data, Variable Value Range ( performance of X’s outside range not known ) & Correlated Variables.

    Characteristics of well planned experiments

    Some of the Characteristics of well-planned experiments are:

    1. The degree of Precision:
      Probability should be high that experiment will be able to measure the differences with a degree of precision the experimenter desires. It implies appropriate design and sufficient replication.
    2. Simplicity:
      As simple as possible consistent with the objectives of experiment.
    3. The absence of Systematic Error:
      Units receiving one treatment should not differ in any systematic way from those receiving other treatment.
    4. The range of Validity of Conclusions:
      Experiments replicated in time and space would increase the range of validity of conclusions.
    5. Calculation of the degree of Uncertainty:
      A possibility of obtaining the observed results by chance alone.

    Three basic principle of Experiments

    Three basic principles of experiments are Randomization, Repetition & Blocking.

    Lets understand this through an example – A food manufacturer is searching for the best recipe for its main product: a pizza dough. The managers decided to perform an experiment to determine the optimal levels of the three main ingredients in the pizza dough: flour, salt, and baking powder. The other ingredients are fixed as they do not affect the flavor of the final cooked pizza. The flavor of the product will be determined by a panel of experts who will give a score to each recipe. Therefore, we have three factors that we will call flour, salt, and baking powder(bakPow), with two levels each (− and +).

    pizzaDesign <- expand.grid(flour = gl(2, 1, labels = c("-",
                                                           "+")),
                               salt = gl(2, 1, labels = c("-", "+")),
                               bakPow = gl(2, 1, labels = c("-", "+")),
                               score = NA)
    

    Now, we have eight different experiments (recipes) including all the possible combinations of the three factors at two levels.
    When we have more than 2 factors, the combination of levels of different factors may affect the response. Therefore, to discover the main effects and the interactions, we should vary more than one level at a time, performing experiments in all the possible combinations.
    The reason Why two-level factor experiments are widely used is that as the number of factor level increases, cost of doing the experiment increases. To study the variation under the same experimental conditions, replication needs to be done, making more than one trial per factor combination. The number of replications depends on several aspects (e.g budget).

    Once an experiment has been designed, we will proceed with its randomization.

    pizzaDesign$ord <- sample(1:8, 8)
    pizzaDesign[order(pizzaDesign$ord),]
    

    Each time you repeat the command you get a different order due to randomization.

    2^k factorial Designs

    2k factorial designs are those whose number of factors to be studied are k, all of them with 2 levels. The number of experiments we need to carryout to obtain a complete replication is precisely the power(2 to the k). If we want n replications of the experiment, then the total number of experiments is n×2k.
    ANOVA can be used to estimate the effect of each factor and interaction and assess which of these effects are significant.

    Example contd.:- The experiment is carried out by preparing the pizzas at the factory following the package instructions, namely: “bake the pizza for 9 min in an oven at 180◦C.”

    After a blind trial is conducted,the scores were given by the experts to each of the eight (2^3) recipes in each replication of the experiment.

    ss.data.doe1 <- data.frame(repl = rep(1:2, each = 8), rbind(pizzaDesign[, -6], pizzaDesign[, -6])) 
    ss.data.doe1$score <- c(5.33, 6.99, 4.23, 6.61, 2.26, 5.75, 3.26, 6.24, 5.7, 7.71, 5.13, 6.76, 2.79, 4.57, 2.48, 6.18)
    
    

    The average for each recipe can be calculated as below:

    aggregate(score ~ flour + salt + bakPow, FUN = mean, data = ss.data.doe1)
    

    The best recipe seems to be the one with a high level of flour and a low level of salt and baking powder. Fit a linear model and perform an ANOVA to find the significant effects.

    doe.model1 <- lm(score ~ flour + salt + bakPow + flour * salt + flour * bakPow + salt * bakPow + flour * salt * bakPow, data = ss.data.doe1) 
    
    
    summary(doe.model1)
    
    

    p-values show that the main effects of the ingredients flour and baking powder are significant, while the effect of the salt is not significant. Interactions among the ingredients are neither 2-way nor 3-way, making them insignificant. Thus, we can simplify the model, excluding the non significant effects. Thus, the new model with the significant effects is :

    doe.model2 <- lm(score ~ flour + bakPow, data = ss.data.doe1) 
    summary(doe.model2)
    

    Therefore,the statistical model for our experiment is
    score =4.8306 + 2.4538*Flour-1.8662*bakpow
    Thus, the recipe with a high level of flour and low level of baking powder will be the best one, regardless of the level of salt (high or low). The estimated score for this recipe is

    `4.8306 + 2.4538 × 1 + (−1.8662) × 0 = 7.284.`

    predict function can be used to get the estimation for all the experiment conditions.

     predict(doe.model2)
    

    Visualize Effect Plot and Interaction plot

    the ggplot2 package can be used to visualize the effect plot. The effect of flour is positive while the effect of baking powder is negative.

    prinEf <- data.frame(Factor = rep(c("A_Flour", "C_Baking Powder"), each = 2), Level = rep(c(-1, 1), 2), Score = c(aggregate(score ~ flour, FUN = mean, data = ss.data.doe1)[,2], aggregate(score ~ bakPow, FUN = mean, data = ss.data.doe1)[,2])) 
    p <- ggplot(prinEf, aes(x = Level, y = Score)) + geom_point() + geom_line() +geom_hline(yintercept =mean(ss.data.doe1$score),linetype="dashed", 
                                                                                            color = "blue")+scale_x_continuous(breaks = c(-1, 1)) + facet_grid(. ~ Factor)+ggtitle("Plot of Factor Effects")
    print(p)
    

    The interaction plot is as shown below.The lines don’t cross means that there is no interaction between the factors plotted.

    intEf <- aggregate(score ~ flour + bakPow, FUN = mean, data = ss.data.doe1) 
    q <- ggplot(intEf, aes(x = flour, y = score, color = bakPow )) + geom_point() + geom_line(aes(group=bakPow)) +geom_hline(yintercept =mean(ss.data.doe1$score),linetype="dashed", 
                                                                                                                            color = "blue")+ggtitle("Interaction Plot")
    print(q)
    

    The normality of residual can be checked with Shapiro test. As the p-value is large, fail to reject the Null hypothesis of Normality of residuals.

    shapiro.test(residuals(doe.model2))
    

    This was the brief introduction to DOE in R.
    In next part, we will go through Control Phase of Six Sigma DMAIC process. Please let me know your feedback in the comments section. Make sure to like & share it. Happy Learning!!

    References

    Related Post

    1. NYC buses: company level predictors with R
    2. Visualizations for correlation matrices in R
    3. Interpretation of the AUC
    4. Simple Experiments with Smoothed Scatterplots
    5. Understanding the Covariance Matrix

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Continue Reading…

    Collapse

    Read More

    “My advisor and I disagree on how we should carry out repeated cross-validation. We would love to have a third expert opinion…”

    Youyou Wu writes:

    I’m a postdoc studying scientific reproducibility. I have a machine learning question that I desperately need your help with. My advisor and I disagree on how we should carry out repeated cross-validation. We would love to have a third expert opinion…

    I’m trying to predict whether a study can be successfully replicated (DV), from the texts in the original published article. Our hypothesis is that language contains useful signals in distinguishing reproducible findings from irreproducible ones. The nuances might be blind to human eyes, but can be detected by machine algorithms.

    The protocol is illustrated in the following diagram to demonstrate the flow of cross-validation. We conducted a repeated three-fold cross-validation on the data.

    STEP 1) Train a doc2vec model on the training data (2/3 of the data) to convert raw texts into vectors representing language features (this algorithm is non-deterministic, the models and the outputs can be different even with the same input and parameter)
    STEP 2) Infer vectors using the doc2vec model for both training and test sets
    STEP 3) Train a logistic regression using the training set
    STEP 4) Apply the logistic regression to the test set, generate a predicted probability of success

    Because doc2vec is not deterministic, and we have a small training sample, we came up with two choices of strategies:

    (1) All studies were first divided into three subsamples A, B, and C. Step 1 through 4 was done once with sample A as the test set, and a combined sample of B and C as the training set, generating on predicted probability for each study in sample A. To generate probabilities for the entire sample, Step 1 through 4 was repeated two more times, setting sample B or C as the test set respectively. At this moment, we had one predicted probability for each study. Subsequently, the entire sample was shuffled to create a different random three-fold partition, followed by same three-fold cross-validation. A new probability was generated for each study this time. The whole procedure was iterated 100 times, so each study had 100 different probabilities. We averaged the probabilities and compared the average probabilities with the ground truth to generate a single AUC score.

    (2) All studies were first divided into three subsamples A, B, and C. Step 1 through 4 was first repeated 100 times with sample A as the test set, and a combined sample of B and C as the training set, generating 100 predicted probabilities for each study in sample A. As I said, these 100 probabilities are different because doc2vec isn’t deterministic. We took the average of these probabilities and treated that as our final estimate for the studies. To generate average probabilities for the entire sample, each group of 100 runs was repeated two more times, setting sample B or C as the test set respectively. An AUC was calculated upon completion, between the ground truth and the average probabilities. Subsequently, the entire sample was shuffled to create a different random three-fold partition, followed by the same 3×100 runs of modeling, generating a new AUC. The whole procedure was iterated on 100 different shuffles, and an AUC score was calculated each time. We ended up having a distribution of 100 AUC scores.

    I personally thought strategy two is better because it separates variation in accuracy due to sampling from the non-determinism of doc2vec. My advisor thought strategy one is better because it’s less computationally intensive and produce better results, and doesn’t have obvious flaws.

    My first thought is to move away from the idea of declaring a study as being “successfully replicated.” Better to acknowledge the continuity of the results from any study.

    Getting to the details of your question on cross-validation: Jeez, this really is complicated. I keep rereading your email over and over again and getting confused each time. So I’ll throw this one out to the commenters. I hope someone can give a useful suggestion . . .

    OK, I do have one idea, and that’s to evaluate your two procedures (1) and (2) using fake-data simulation: Start with a known universe, simulate fake data from that universe, then apply procedures (1) and (2) and see if they give much different answers. Loop the entire procedure and see what happens, comparing your cross-validation results to the underlying truth which in this case is assumed known. Fake-data simulation is the brute-force approach to this problem, and perhaps it’s a useful baseline to help understand your problem.

    The post “My advisor and I disagree on how we should carry out repeated cross-validation. We would love to have a third expert opinion…” appeared first on Statistical Modeling, Causal Inference, and Social Science.

    Continue Reading…

    Collapse

    Read More

    New talk: High Reliability Infrastructure Migrations

    On Tuesday I gave a talk at KubeCon called High Reliability Infrastructure Migrations. The abstract was:

    For companies with high availability requirements (99.99% uptime or higher), running new software in production comes with a lot of risks. But it’s possible to make significant infrastructure changes while maintaining the availability your customers expect! I’ll give you a toolbox for derisking migrations and making infrastructure changes with confidence, with examples from our Kubernetes & Envoy experience at Stripe.

    video

    slides

    Here are the slides:

    since everyone always asks, I drew them in the Notability app on an iPad. I do this because it’s faster than trying to use regular slides software and I can make better slides.

    a few notes

    Here are a few links & notes about things I mentioned in the talk

    skycfg: write functions, not YAML

    I talked about how my team is working on non-YAML interfaces for configuring Kubernetes. The demo is at skycfg.fun, and it’s on GitHub here. It’s based on Starlark, a configuration language that’s a subset of Python.

    My coworker John has promised that he’ll write a blog post about it at some point, and I’m hoping that’s coming soon :)

    no haunted forests

    I mentioned a deploy system rewrite we did. John has a great blog post about when rewrites are a good idea and how he approached that rewrite called no haunted forests.

    ignore most kubernetes ecosystem software

    One small point that I made in the talk was that on my team we ignore almost all software in the Kubernetes ecosystem so that we can focus on a few core pieces (Kubernetes & Envoy, plus some small things like kiam). I wanted to mention this because I think often in Kubernetes land it can seem like everyone is using Cool New Things (helm! istio! knative! eep!). I’m sure those projects are great but I find it much simpler to stay focused on the basics and I wanted people to know that it’s okay to do that if that’s what works for your company.

    I think the reality is that actually a lot of folks are still trying to work out how to use this new software in a reliable and secure way.

    other talks

    I haven’t watched other Kubecon talks yet, but here are 2 links:

    I heard good things about this keynote from melanie cebula about kubernetes at airbnb, and I’m excited to see this talk about kubernetes security. The slides from that security talk look useful

    Also I’m very excited to see Kelsey Hightower’s keynote as always, but that recording isn’t up yet. If you have other Kubecon talks to recommend I’d love to know what they are.

    my first work talk I’m happy with

    I usually give talks about debugging tools, or side projects, or how I approach my job at a high level – not on the actual work that I do at my job. What I talked about in this talk is basically what I’ve been learning how to do at work for the last ~2 years. Figuring out how to make big infrastructure changes safely took me a long time (and I’m not done!), and so I hope this talk helps other folks do the same thing.

    Continue Reading…

    Collapse

    Read More

    How to deploy a predictive service to Kubernetes with R and the AzureContainers package

    It's easy to create a function in R, but what if you want to call that function from a different application, with the scale to support a large number of simultaneous requests? This article shows how you can deploy an R fitted model as a Plumber web service in Kubernetes, using Azure Container Registry (ACR) and Azure Kubernetes Service (AKS). We use the AzureContainers package to create the necessary resources and deploy the service.

    Fit the model

    We’ll fit a simple model for illustrative purposes, using the Boston housing dataset (which ships with R in the MASS package). To make the deployment process more interesting, the model we fit will be a random forest, using the randomForest package. This is not part of R, so we’ll have to install it from CRAN.

    data(Boston, package="MASS")
    install.packages("randomForest")
    library(randomForest)
    
    # train a model for median house price
    bos_rf <- randomForest(medv ~ ., data=Boston, ntree=100)
    
    # save the model
    saveRDS(bos_rf, "bos_rf.rds")

    Scoring script for plumber

    Now that we have the model, we also need a script to obtain predicted values from it given a set of inputs:

    # save as bos_rf_score.R
    
    bos_rf <- readRDS("bos_rf.rds")
    library(randomForest)
    
    #* @param df data frame of variables
    #* @post /score
    function(req, df)
    {
        df <- as.data.frame(df)
        predict(bos_rf, df)
    }

    This is fairly straightforward, but the comments may require some explanation. They are plumber annotations that tell it to call the function if the server receives a HTTP POST request with the path /score, and query parameter df. The value of the df parameter is then converted to a data frame, and passed to the randomForest predict method. For a fuller description of how Plumber works, see the Plumber website.

    Create a Dockerfile

    Let’s package up the model and the scoring script into a Docker image. A Dockerfile to do this is shown below. This uses the base image supplied by Plumber (trestletech/plumber), installs randomForest, and then adds the model and the above scoring script. Finally, it runs the code that will start the server and listen on port 8000.

    # example Dockerfile to expose a plumber service
    
    FROM trestletech/plumber
    
    # install the randomForest package
    RUN R -e 'install.packages(c("randomForest"))'
    
    # copy model and scoring script
    RUN mkdir /data
    COPY bos_rf.rds /data
    COPY bos_rf_score.R /data
    WORKDIR /data
    
    # plumb and run server
    EXPOSE 8000
    ENTRYPOINT ["R", "-e", \
        "pr <- plumber::plumb('/data/bos_rf_score.R'); \
        pr$run(host='0.0.0.0', port=8000)"]

    Build and upload the image

    The code to store our image on Azure Container Registry is as follows. This calls AzureRMR to login to Resource Manager, creates an Azure Container Registry resource (a Docker registry hosted in Azure), and then pushes the image to the registry.

    If this is the first time you are using AzureRMR, you’ll have to create a service principal first. For more information on how to do this, see the AzureRMR readme.

    library(AzureContainers)
    
    az <- AzureRMR::az_rm$new(
        tenant="myaadtenant.onmicrosoft.com",
        app="app_id",
        password="password")
    
    # create a resource group for our deployments
    deployresgrp <- az$
        get_subscription("subscription_id")$
        create_resource_group("deployresgrp", location="australiaeast")
    
    # create container registry
    deployreg_svc <- deployresgrp$create_acr("deployreg")
    
    # build image 'bos_rf'
    call_docker("build -t bos_rf .")
    
    # upload the image to Azure
    deployreg <- deployreg_svc$get_docker_registry()
    deployreg$push("bos_rf")

    If you run this code, you should see a lot of output indicating that R is downloading, compiling and installing randomForest, and finally that the image is being pushed to Azure. (You will see this output even if your machine already has the randomForest package installed. This is because the package is being installed to the R session inside the container, which is distinct from the one running the code shown here.)

    All docker calls in AzureContainers, like the one to build the image, return the actual docker commandline as the cmdline attribute of the (invisible) returned value. In this case, the commandline is docker build -t bos_rf . Similarly, the push() method actually involves two Docker calls, one to retag the image, and the second to do the actual pushing; the returned value in this case will be a 2-component list with the command lines being docker tag bos_rf deployreg.azurecr.io/bos_rf and docker push deployreg.azurecr.io/bos_rf.

    Deploy to a Kubernetes cluster

    The code to create an AKS resource (a managed Kubernetes cluster in Azure) is quite simple:

    # create a Kubernetes cluster with 2 nodes, running Linux
    deployclus_svc <- deployresgrp$create_aks("deployclus",
        agent_pools=aks_pools("pool1", 2))

    Creating a Kubernetes cluster can take several minutes. By default, the create_aks() method will wait until the cluster provisioning is complete before it returns.

    Having created the cluster, we can deploy our model and create a service. We’ll use a YAML configuration file to specify the details for the deployment and service API.

    apiVersion: extensions/v1beta1
    kind: Deployment
    metadata:
      name: bos-rf
    spec:
      replicas: 1
      template:
        metadata:
          labels:
            app: bos-rf
        spec:
          containers:
          - name: bos-rf
            image: deployreg.azurecr.io/bos_rf
            ports:
            - containerPort: 8000
            resources:
              requests:
                cpu: 250m
              limits:
                cpu: 500m
          imagePullSecrets:
          - name: deployreg.azurecr.io
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: bos-rf-svc
    spec:
      selector:
        app: bos-rf
      type: LoadBalancer
      ports:
      - protocol: TCP
        port: 8000

    The following code will obtain the cluster endpoint from the AKS resource and then deploy the image and service to the cluster. The configuration details for the deployclus cluster are stored in a file located in the R temporary directory; all of the cluster’s methods will use this file. Unless told otherwise, AzureContainers does not touch your default Kubernetes configuration (~/kube/config).

    # get the cluster endpoint
    deployclus <- deployclus_svc$get_cluster()
    
    # pass registry authentication details to the cluster
    deployclus$create_registry_secret(deployreg,
        email="me@example.com")
    
    # create and start the service
    deployclus$create("bos_rf.yaml")

    To check on the progress of the deployment, run the get() methods specifying the type and name of the resource to get information on. As with Docker, these correspond to calls to the kubectl commandline tool, and again, the actual commandline is stored as the cmdline attribute of the returned value.

    deployclus$get("deployment bos-rf")
    #> Kubernetes operation: get deployment bos-rf  --kubeconfig=".../kubeconfigxxxx"
    #> NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
    #> bos-rf    1         1         1            1           5m
    
    deployclus$get("service bos-rf-svc")
    #> Kubernetes operation: get service bos-rf-svc  --kubeconfig=".../kubeconfigxxxx"
    #> NAME         TYPE           CLUSTER-IP   EXTERNAL-IP     PORT(S)          AGE
    #> bos-rf-svc   LoadBalancer   10.0.8.189   52.187.249.58   8000:32276/TCP   5m 

    Once the service is up and running, as indicated by the presence of an external IP in the service details, let’s test it with a HTTP request. The response should look like this.

    response <- httr::POST("http://52.187.249.58:8000/score",
        body=list(df=MASS::Boston[1:10,]), encode="json")
    httr::content(response, simplifyVector=TRUE)
    #> [1] 25.9269 22.0636 34.1876 33.7737 34.8081 27.6394 21.8007 22.3577 16.7812 18.9785

    Finally, once we are done, we can tear down the service and deployment:

    deployclus$delete("service", "bos-rf-svc")
    deployclus$delete("deployment", "bos-rf")

    And if required, we can also delete all the resources created here, by simply deleting the resource group (AzureContainers will prompt you for confirmation):

    deployresgrp$delete()

    See also

    An alternative to Plumber is the model operationalisation framework found in Microsoft Machine Learning Server. While it is proprietary software, unlike Plumber which is open source, ML Server provides a number of features not available in the latter. These include model management, so that you can easily access multiple versions of a given model; user authentication, so that only authorised users can access your service; and batch (asynchronous) requests. For more information, see the MMLS documentation.

    Continue Reading…

    Collapse

    Read More

    Are the holidays really the hardest time of year? The stats are surprising

    Studies have found that mental health-related ER visits decrease around Christmas, despite an array of stresses

    Night falls before you’ve left work, your very best winter coat doesn’t really keep you warm enough and the only thing you have to look forward to is sitting on your parent’s couch and realizing that you haven’t really changed since you were 14. The holidays are hard.

    But are they harder than the rest of the year? It’s a question I was asked by one of you and the data I found came as a surprise to me. In 2011, a study titled The Christmas Effect on Psychopathology reviewed the available research on this question (psychopathology is the study of mental health). The authors found that ER visits for mental health issues actually fell during the week of Christmas.

    Continue reading...

    Continue Reading…

    Collapse

    Read More

    Day 15 – little helper sci_palette

    (This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

    We at STATWORX work a lot with R and we often use the same little helper functions within our projects. These functions ease our daily work life by reducing repetitive code parts or by creating overviews of our projects. At first, there was no plan to make a package, but soon I realised, that it will be much easier to share and improve those functions, if they are within a package. Up till the 24th December I will present one function each day from helfRlein. So, on the 15th day of Christmas my true love gave to me…

    door-15-sci-palette

    What can it do?

    This little helper returns a set of colors which we often use at STATWORX. So, if – like me – you cannot remeber each hex color code you need, this might help. Of course these are our colours, but you could rewrite it with your own palette. But the main benefactor is the plotting method – so you can see the color instead of only reading the hex code.

    How to use it?

    To see which hex code corresponds to which colour and for what purpose to use it

    sci_palette()
    
    main_color accent_color_1 accent_color_2 accent_color_3    highlight        black 
     "#013848"      "#0085AF"      "#00A378"      "#09557F"    "#FF8000"    "#000000" 
          text         grey_2     light_gray        special 
     "#696969"      "#D9D9D9"      "#F8F8F8"      "#C62F4B" 
    attr(,"class")
    [1] "sci"
    

    As mentioned above, there is a plot() method which gives the following picture.

    plot(sci_palette())
    

    day-15-sci-palette

    Overview

    To see all the other functions you can either check out our GitHub or you can read about them here.

    Have a merry advent season!

    Über den Autor

    Jakob Gepp

    Jakob Gepp

    Numbers were always my passion and as a data scientist and statistician at STATWORX I can fullfill my nerdy needs. Also I am responsable for our blog. So if you have any questions or suggestions, just send me an email!

    ABOUT US


    STATWORX
    is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. 

    Sign Up Now!

    Der Beitrag Day 15 – little helper sci_palette erschien zuerst auf STATWORX.

    To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Continue Reading…

    Collapse

    Read More

    Royal Statistical Society Christmas quiz: 25th anniversary edition

    Solving the RSS’s fiendishly tricky festive quiz will require general knowledge, logic and lateral thinking

    For the last quarter-century, the Royal Statistical Society has published a fiendishly difficult Christmas quiz to entertain puzzle fans over the festive break – and this year’s special 25th anniversary edition, devised by Dr Tim Paulden, is sure to get the cogs spinning after a glass or two of mulled wine. Cracking the 15 problems below will require a potent mix of general knowledge, logic, and lateral thinking – but, as usual, no specialist mathematical knowledge is needed.

    Two helpful tips for budding solvers:

    Continue reading...

    Continue Reading…

    Collapse

    Read More

    Magister Dixit

    “There´s only one purpose to data science, and that is to support decisions. And more specifically, to make better decisions. That should be something no one can argue with.” Eugene Dubossarsky ( February 15, 2018 )

    Continue Reading…

    Collapse

    Read More

    If you did not already know

    Francy google
    Data visualization and interaction with large data sets is known to be essential and critical in many businesses today, and the same applies to research and teaching, in this case, when exploring large and complex mathematical objects. GAP is a computer algebra system for computational discrete algebra with an emphasis on computational group theory. The existing XGAP package for GAP works exclusively on the X Window System. It lacks abstraction between its mathematical and graphical cores, making it difficult to extend, maintain, or port. In this paper, we present Francy, a graphical semantics package for GAP. Francy is responsible for creating a representational structure that can be rendered using many GUI frameworks independent from any particular programming language or operating system. Building on this, we use state of the art web technologies that take advantage of an improved REPL environment, which is currently under development for GAP. The integration of this project with Jupyter provides a rich graphical environment full of features enhancing the usability and accessibility of GAP. …

    Probabilistic Latent Semantic Analysis (PLSA) google
    We consider the problem of discovering the simplest latent variable that can make two observed discrete variables conditionally independent. This problem has appeared in the literature as probabilistic latent semantic analysis (pLSA), and has connections to non-negative matrix factorization. When the simplicity of the variable is measured through its cardinality, we show that a solution to this latent variable discovery problem can be used to distinguish direct causal relations from spurious correlations among almost all joint distributions on simple causal graphs with two observed variables. Conjecturing a similar identifiability result holds with Shannon entropy, we study a loss function that trades-off between entropy of the latent variable and the conditional mutual information of the observed variables. We then propose a latent variable discovery algorithm — LatentSearch — and show that its stationary points are the stationary points of our loss function. We experimentally show that LatentSearch can indeed be used to distinguish direct causal relations from spurious correlations.
    Entropic Latent Variable Discovery


    Attention Gated Network google
    We propose a novel attention gate (AG) model for medical image analysis that automatically learns to focus on target structures of varying shapes and sizes. Models trained with AGs implicitly learn to suppress irrelevant regions in an input image while highlighting salient features useful for a specific task. This enables us to eliminate the necessity of using explicit external tissue/organ localisation modules when using convolutional neural networks (CNNs). AGs can be easily integrated into standard CNN models such as VGG or U-Net architectures with minimal computational overhead while increasing the model sensitivity and prediction accuracy. The proposed AG models are evaluated on a variety of tasks, including medical image classification and segmentation. For classification, we demonstrate the use case of AGs in scan plane detection for fetal ultrasound screening. We show that the proposed attention mechanism can provide efficient object localisation while improving the overall prediction performance by reducing false positives. For segmentation, the proposed architecture is evaluated on two large 3D CT abdominal datasets with manual annotations for multiple organs. Experimental results show that AG models consistently improve the prediction performance of the base architectures across different datasets and training sizes while preserving computational efficiency. Moreover, AGs guide the model activations to be focused around salient regions, which provides better insights into how model predictions are made. The source code for the proposed AG models is publicly available. …

    Continue Reading…

    Collapse

    Read More

    Distilled News

    Is The Enterprise Knowledge Graph Finally Going To Make All Data Usable?

    When we ask Siri, Alexa or Google Home a question, we often get alarmingly relevant answers. Why? And more importantly, why don’t we get the same quality of answers and smooth experience in our businesses where the stakes are so much higher? The answer is that these services are all powered by extensive knowledge graphs that allow the questions to be mapped to an organized set of information that can often provide the answer we want. Is it impossible for anyone but the big tech companies to organize information and deliver a pleasing experience? In my view, the answer is no. The technology to collect and integrate data so we can know more about our businesses is being delivered in different ways by a number of products. Only a few use constructs similar to a knowledge graph.

    Continue Reading…

    Collapse

    Read More

    RStudio Pandoc – HTML To Markdown

    (This article was first published on R on YIHAN WU, and kindly contributed to R-bloggers)

    The knitr and rmarkdown packages are used in conjunction with pandoc to convert R code and figures to a variety of formats including PDF, and word. Here, I’m exploring how to convert HTML back to markdown format. This post came about when I was searching how to convert XML to markdown, which I still haven’t found an easy way to do. Pandoc is not the only way to convert HTML to markdown (see turndown, html2text)

    Pandoc is packaged within RStudio and on Windows, the executables are located within Program Files/RStudio/bin/pandoc. The rmarkdown package contains wrapper functions for using pandoc within RStudio.

    Here, I am trying to convert this example HTML page back to markdown using the function pandoc_convert. First, pandoc_convert requires an actual file which means it does not accept a quoted string of HTML code in its input argument.

    The example html:

    
    
    
    Enter a title, displayed at the top of the window.
    
    
    
    

    Enter the main heading, usually the same as the title.

    Be bold in stating your key points. Put them in a list:

    • The first item in your list
    • The second item; italicize key words

    Improve your image by including an image.

    A Great HTML Resource

    Add a link to your favorite Web site. Break up your page with a horizontal rule or two.


    Finally, link to another page in your own Web site.

    © Wiley Publishing, 2011

    I saved the HTML example here as example.html.

    html_page <- readLines("../../static/files/example.html")

    We can print the object in R.

    cat(html_page)
    ##    Enter a title, displayed at the top of the window.    

    Enter the main heading, usually the same as the title.

    Be bold in stating your key points. Put them in a list:

    • The first item in your list
    • The second item; italicize key words

    Improve your image by including an image.

    A Great HTML Resource

    Add a link to your favorite Web site. Break up your page with a horizontal rule or two.


    Finally, link to another page in your own Web site.

    © Wiley Publishing, 2011

    pandoc can convert between many different formats, and for markdown, it has multiple variants including the github flavored variant (for Github), and php markdown extra (the variant used by WordPress sites).

    The safest variant to pick is markdown_strict which is the original markdown variant.

    Pandoc requires the file path which in my case, is located in a different directory rather than my working directory.

    library(rmarkdown)
    file_path <- "../../static/files/example.html"
    pandoc_convert(file_path, to = "markdown_strict")
    Enter the main heading, usually the same as the title.
    ======================================================
    
    Be **bold** in stating your key points. Put them in a list:
    
    -   The first item in your list
    -   The second item; *italicize* key words
    
    Improve your image by including an image.
    
    ![A Great HTML Resource](http://www.mygifs.com/CoverImage.gif)
    
    Add a link to your favorite [Web site](https://www.dummies.com/). Break
    up your page with a horizontal rule or two.
    
    ------------------------------------------------------------------------
    
    Finally, link to [another page](page2.html) in your own Web site.
    
    © Wiley Publishing, 2011
    

    Notice that heading 1 is formatted with ==== rather than the # that RMarkdown seems to favor. We can require pandoc to use the # during the conversion by adding an argument.

    pandoc_convert(file_path, to = "markdown_strict", options = c("--atx-headers"))
    # Enter the main heading, usually the same as the title.
    
    Be **bold** in stating your key points. Put them in a list:
    
    -   The first item in your list
    -   The second item; *italicize* key words
    
    Improve your image by including an image.
    
    ![A Great HTML Resource](http://www.mygifs.com/CoverImage.gif)
    
    Add a link to your favorite [Web site](https://www.dummies.com/). Break
    up your page with a horizontal rule or two.
    
    ------------------------------------------------------------------------
    
    Finally, link to [another page](page2.html) in your own Web site.
    
    © Wiley Publishing, 2011

    Right now, the output is being piped to the console. A file can be created instead with:

    pandoc_convert(file_path, to = "markdown_strict", output = "example.md")

    Pandoc has a multitude of styling extensions for markdown variants, all listed on the manual page.

    Pandoc ignores everything enclosed in . When converting from markdown to HTML, these comments are usually directly placed as is in the HTML document but the opposite does not seem to be true.

    Lastly, this was tested using pandoc version 1.19.2.1. Pandoc 2.5 was released last month.

    To leave a comment for the author, please follow the link and comment on their blog: R on YIHAN WU.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Continue Reading…

    Collapse

    Read More

    Advent of Code: Most Popular Languages

    (This article was first published on Posts on Maëlle's R blog, and kindly contributed to R-bloggers)

    You might have heard of the Advent of Code,
    a 25-day challenge involving a programming puzzle a day, to be solved
    with the language of your choice. I’ve noted the popularity of this
    activity in my Twitter timeline but also in my GitHub timeline where
    I’ve seen the creation of a few advent-of-code or so repositories.

    If I were to participate one year, I’d probably use R. Jenny Bryan’s
    tweet above inspired me to try and gauge the popularity of languages
    used in the Advent of Code. To do that, in this post, I shall use the
    search endpoint of GitHub V3 API to identify Advent of Code 2018 repos.

    Searching repositories on GitHub

    Study design 😉

    GitHub’s V3 API offers a search
    endpoint
    ,
    that however gives you less results than doing the same search via the
    web interface
    , even when
    using pagination right (or at least, when believing I use pagination
    right!). I’m however willing to use that sub-sample as basis for my
    study of language popularity. It’s actually a sub-sub-sample, since I’m
    only looking at Advent of code projects published on GitHub.

    In order to circumvent the sub-sub-sampling a bit, I’ll do the search in
    two steps:

    • Searching for Advent of code 2018 in general among repos, and
      extracting the language of the repos.

    • Searching for Advent of code 2018 for each of these languages
      separately
      and extracting the total count of hits.

    Note that I am not filtering the repos by activity, so some of them
    could very well have been created for a few days only. If they are empty
    though, they do not get assigned a language.

    Regarding the language of repos, GitHub assigns a language to each
    repository. This information can be wrong, which is e.g. mentioned in
    rOpenSci’s development
    guide
    .
    Furthermore, my using this piece of information means I’m disregarding
    the fact that some people actually use a mix of technologies to solve
    the puzzles
    .

    Actual queries

    I first defined a function to search the API whilst respecting the rate
    limiting. I even erred on the side of caution and queried very slowly.

    .search <- function(page){
      gh::gh("GET /search/repositories",
             q = "adventofcode 2018",
             page = page,
             fork = FALSE)
    }
    
    search <- ratelimitr::limit_rate(.search,
                                    ratelimitr::rate(10, 60))
    

    I then wrote two other functions to help me rectangle the API output for
    each repository.

    empty_null <- function(x){
      if(is.null(x)){
        ""
      }else{
        x
      }
    }
    
    rectangle <- function(item){
      tibble::tibble(full_name = item$full_name,
                     language = empty_null(item$language))
    }
    

    I created a function putting these two pieces together.

    get_page <- function(page){
      results <- try(search(page), silent = TRUE)
    
      # an early return
      if(inherits(results, "try-error")){
        return(NULL)
      }
    
      purrr::map_df(results$items,
                    rectangle)
    }
    

    And I then ran the following pipeline.

    total_count <- search(1)$total_count
    pages <- 1:(ceiling(total_count/100))
    
    results <- purrr::map_df(pages, get_page)
    results <- unique(results)
    
    languages <- unique(results$language)
    languages <- languages[languages != ""]
    

    This got me 814 repos, with 46 non empty languages. Repo names are quite
    varied: rdmueller/aoc-2018, petertseng/adventofcode-rb-2018,
    NiXXeD/adventofcode, Arxcis/adventofcode2018,
    Stupremee/adventofcode-2018, phaazon/advent-of-code-2k18.

    With that information obtained, I was able to run a query by language.

    .get_one_language_count <- function(language){
      gh::gh("GET /search/repositories",
             q = glue::glue("adventofcode 2018&language:{language}"),
             fork = FALSE)$total_count -> count
      tibble::tibble(language = language,
                     count = count)
    }
    
    get_one_language_count <- ratelimitr::limit_rate(.get_one_language_count,
                                     ratelimitr::rate(10, 60))
    
    counts <- purrr::map_df(languages,
                            get_one_language_count)
    

    In total, the counts table contains information about 2080
    repositories, a bit less than half the number of Advent of code 2018
    repositories I’d find via the web interface.

    Advent of Code’s languages popularity

    I’ll concentrate on the 15 most popular languages in the sample, which
    automatically excludes R with… 8 repositories only.

    library("ggplot2")
    library("ggalt")
    library("hrbrthemes")
    library("magrittr")
    
    counts %>%
      dplyr::arrange(- count) %>%
      head(n = 15) %>%
      dplyr::mutate(language = reorder(language, count))   %>%
      ggplot() +
      geom_lollipop(aes(language, count),
                    size = 2, col = "salmon") +
      hrbrthemes::theme_ipsum(base_size = 16,
                              axis_title_size = 16) +
      coord_flip() +
      ggtitle("Advent of Code Languages",
              subtitle = "Among a sample of GitHub repositories, with language information from GitHub linguist")
    

    popularity of languages

    The results are not surprising after reading e.g. the insights from
    Stack Overflow’s 2018
    survey
    , although the Python domination is crazy! I find interesting to reflect on the
    fact that Jenny Bryan says that the challenge is best for C or C++, that
    are not the most popular languages in these samples… but still more
    popular than R, ok.

    Conclusion

    In this post I used GitHub V3 API to get a glimpse at the popularity of languages used to solve the Advent of Code. Further work could include looking at the completion of the challenge by language, potentially using the GitHub activity of each repo as an (imperfect) proxy.

    I do not take part in the challenge myself, my principal Advent’s
    specific activity being instead the lazy and delightful watching of the
    Swedish TV channel SVT Adventskalender for
    kids
    .
    Incidentally, this year’s
    storyline

    includes a Christmas competition, which however features competitive
    eating of saffron buns and gingerhouse building rather than programming
    puzzles… Do you participate to Advent of Code this year? If so, with
    which language and why?

    To leave a comment for the author, please follow the link and comment on their blog: Posts on Maëlle's R blog.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Continue Reading…

    Collapse

    Read More

    GDP predictions are reliable only in the short term

    They perform far better when forecasting growth years than downturns

    Continue Reading…

    Collapse

    Read More

    Manipulate dates easily with {lubridate}

    (This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

    This blog post is an excerpt of my ebook Modern R with the tidyverse that you can read for
    free here. This is taken from Chapter 5, which presents
    the {tidyverse} packages and how to use them to compute descriptive statistics and manipulate data.
    In the text below, I scrape a table from Wikipedia, which shows when African countries gained
    independence from other countries. Then, using {lubridate} functions I show you how you can
    answers questions such as Which countries gained independence before 1960?.

    Set-up: scraping some data from Wikipedia

    {lubridate} is yet another tidyverse package, that makes dealing with dates or duration data
    (and intervals) as painless as possible. I do not use every function contained in the package
    daily, and as such will only focus on some of the functions. However, if you have to deal with
    dates often, you might want to explore the package thoroughly.

    Let’s get some data from a Wikipedia table:

    library(tidyverse)
    library(rvest)
    page <- read_html("https://en.wikipedia.org/wiki/Decolonisation_of_Africa")
    
    independence <- page %>%
        html_node(".wikitable") %>%
        html_table(fill = TRUE)
    
    independence <- independence %>%
        select(-Rank) %>%
        map_df(~str_remove_all(., "\\[.*\\]")) %>%
        rename(country = `Country[a]`,
               colonial_name = `Colonial name`,
               colonial_power = `Colonial power[b]`,
               independence_date = `Independence date`,
               first_head_of_state = `First head of state[d]`,
               independence_won_through = `Independence won through`)

    This dataset was scraped from the following Wikipedia table.
    It shows when African countries gained independence from which colonial powers. In Chapter 11, I
    will show you how to scrape Wikipedia pages using R. For now, let’s take a look at the contents
    of the dataset:

    independence
    ## # A tibble: 54 x 6
    ##    country colonial_name colonial_power independence_da… first_head_of_s…
    ##                                                 
    ##  1 Liberia Liberia       United States  26 July 1847     Joseph Jenkins …
    ##  2 South … Cape Colony … United Kingdom 31 May 1910      Louis Botha     
    ##  3 Egypt   Sultanate of… United Kingdom 28 February 1922 Fuad I          
    ##  4 Eritrea Italian Erit… Italy          10 February 1947 Haile Selassie  
    ##  5 Libya   British Mili… United Kingdo… 24 December 1951 Idris           
    ##  6 Sudan   Anglo-Egypti… United Kingdo… 1 January 1956   Ismail al-Azhari
    ##  7 Tunisia French Prote… France         20 March 1956    Muhammad VIII a…
    ##  8 Morocco French Prote… France Spain   2 March 19567 A… Mohammed V      
    ##  9 Ghana   Gold Coast    United Kingdom 6 March 1957     Kwame Nkrumah   
    ## 10 Guinea  French West … France         2 October 1958   Ahmed Sékou Tou…
    ## # ... with 44 more rows, and 1 more variable:
    ## #   independence_won_through 

    as you can see, the date of independence is in a format that might make it difficult to answer questions
    such as Which African countries gained independence before 1960 ? for two reasons. First of all,
    the date uses the name of the month instead of the number of the month (well, this is not such a
    big deal, but still), and second of all the type of
    the independence day column is character and not “date”. So our first task is to correctly define the column
    as being of type date, while making sure that R understands that January is supposed to be “01”, and so
    on.

    Using {lubridate}

    There are several helpful functions included in {lubridate} to convert columns to dates. For instance
    if the column you want to convert is of the form “2012-11-21”, then you would use the function ymd(),
    for “year-month-day”. If, however the column is “2012-21-11”, then you would use ydm(). There’s
    a few of these helper functions, and they can handle a lot of different formats for dates. In our case,
    having the name of the month instead of the number might seem quite problematic, but it turns out
    that this is a case that {lubridate} handles painfully:

    library(lubridate)
    ## 
    ## Attaching package: 'lubridate'
    ## The following object is masked from 'package:base':
    ## 
    ##     date
    independence <- independence %>%
      mutate(independence_date = dmy(independence_date))
    ## Warning: 5 failed to parse.

    Some dates failed to parse, for instance for Morocco. This is because these countries have several
    independence dates; this means that the string to convert looks like:

    "2 March 1956
    7 April 1956
    10 April 1958
    4 January 1969"

    which obviously cannot be converted by {lubridate} without further manipulation. I ignore these cases for
    simplicity’s sake.

    Let’s take a look at the data now:

    independence
    ## # A tibble: 54 x 6
    ##    country colonial_name colonial_power independence_da… first_head_of_s…
    ##                                                
    ##  1 Liberia Liberia       United States  1847-07-26       Joseph Jenkins …
    ##  2 South … Cape Colony … United Kingdom 1910-05-31       Louis Botha     
    ##  3 Egypt   Sultanate of… United Kingdom 1922-02-28       Fuad I          
    ##  4 Eritrea Italian Erit… Italy          1947-02-10       Haile Selassie  
    ##  5 Libya   British Mili… United Kingdo… 1951-12-24       Idris           
    ##  6 Sudan   Anglo-Egypti… United Kingdo… 1956-01-01       Ismail al-Azhari
    ##  7 Tunisia French Prote… France         1956-03-20       Muhammad VIII a…
    ##  8 Morocco French Prote… France Spain   NA               Mohammed V      
    ##  9 Ghana   Gold Coast    United Kingdom 1957-03-06       Kwame Nkrumah   
    ## 10 Guinea  French West … France         1958-10-02       Ahmed Sékou Tou…
    ## # ... with 44 more rows, and 1 more variable:
    ## #   independence_won_through 

    As you can see, we now have a date column in the right format. We can now answer questions such as
    Which countries gained independence before 1960? quite easily, by using the functions year(),
    month() and day(). Let’s see which countries gained independence before 1960:

    independence %>%
      filter(year(independence_date) <= 1960) %>%
      pull(country)
    ##  [1] "Liberia"                          "South Africa"                    
    ##  [3] "Egypt"                            "Eritrea"                         
    ##  [5] "Libya"                            "Sudan"                           
    ##  [7] "Tunisia"                          "Ghana"                           
    ##  [9] "Guinea"                           "Cameroon"                        
    ## [11] "Togo"                             "Mali"                            
    ## [13] "Madagascar"                       "Democratic Republic of the Congo"
    ## [15] "Benin"                            "Niger"                           
    ## [17] "Burkina Faso"                     "Ivory Coast"                     
    ## [19] "Chad"                             "Central African Republic"        
    ## [21] "Republic of the Congo"            "Gabon"                           
    ## [23] "Mauritania"

    You guessed it, year() extracts the year of the date column and converts it as a numeric so that we can work
    on it. This is the same for month() or day(). Let’s try to see if countries gained their independence on
    Christmas Eve:

    independence %>%
      filter(month(independence_date) == 12,
             day(independence_date) == 24) %>%
      pull(country)
    ## [1] "Libya"

    Seems like Libya was the only one! You can also operate on dates. For instance, let’s compute the difference between
    two dates, using the interval() column:

    independence %>%
      mutate(today = lubridate::today()) %>%
      mutate(independent_since = interval(independence_date, today)) %>%
      select(country, independent_since)
    ## # A tibble: 54 x 2
    ##    country      independent_since             
    ##                            
    ##  1 Liberia      1847-07-26 UTC--2018-12-15 UTC
    ##  2 South Africa 1910-05-31 UTC--2018-12-15 UTC
    ##  3 Egypt        1922-02-28 UTC--2018-12-15 UTC
    ##  4 Eritrea      1947-02-10 UTC--2018-12-15 UTC
    ##  5 Libya        1951-12-24 UTC--2018-12-15 UTC
    ##  6 Sudan        1956-01-01 UTC--2018-12-15 UTC
    ##  7 Tunisia      1956-03-20 UTC--2018-12-15 UTC
    ##  8 Morocco      NA--NA                        
    ##  9 Ghana        1957-03-06 UTC--2018-12-15 UTC
    ## 10 Guinea       1958-10-02 UTC--2018-12-15 UTC
    ## # ... with 44 more rows

    The independent_since column now contains an interval object that we can convert to years:

    independence %>%
      mutate(today = lubridate::today()) %>%
      mutate(independent_since = interval(independence_date, today)) %>%
      select(country, independent_since) %>%
      mutate(years_independent = as.numeric(independent_since, "years"))
    ## # A tibble: 54 x 3
    ##    country      independent_since              years_independent
    ##                                         
    ##  1 Liberia      1847-07-26 UTC--2018-12-15 UTC             171. 
    ##  2 South Africa 1910-05-31 UTC--2018-12-15 UTC             109. 
    ##  3 Egypt        1922-02-28 UTC--2018-12-15 UTC              96.8
    ##  4 Eritrea      1947-02-10 UTC--2018-12-15 UTC              71.8
    ##  5 Libya        1951-12-24 UTC--2018-12-15 UTC              67.0
    ##  6 Sudan        1956-01-01 UTC--2018-12-15 UTC              63.0
    ##  7 Tunisia      1956-03-20 UTC--2018-12-15 UTC              62.7
    ##  8 Morocco      NA--NA                                      NA  
    ##  9 Ghana        1957-03-06 UTC--2018-12-15 UTC              61.8
    ## 10 Guinea       1958-10-02 UTC--2018-12-15 UTC              60.2
    ## # ... with 44 more rows

    We can now see for how long the last country to gain independence has been independent.
    Because the data is not tidy (in some cases, an African country was colonized by two powers,
    see Libya), I will only focus on 4 European colonial powers: Belgium, France, Portugal and the United Kingdom:

    independence %>%
      filter(colonial_power %in% c("Belgium", "France", "Portugal", "United Kingdom")) %>%
      mutate(today = lubridate::today()) %>%
      mutate(independent_since = interval(independence_date, today)) %>%
      mutate(years_independent = as.numeric(independent_since, "years")) %>%
      group_by(colonial_power) %>%
      summarise(last_colony_independent_for = min(years_independent, na.rm = TRUE))
    ## # A tibble: 4 x 2
    ##   colonial_power last_colony_independent_for
    ##                                   
    ## 1 Belgium                               56.5
    ## 2 France                                41.5
    ## 3 Portugal                              43.1
    ## 4 United Kingdom                        42.5

    {lubridate} contains many more functions. If you often work with dates, duration or interval data, {lubridate}
    is a package that you have to master.

    Hope you enjoyed! If you found this blog post useful, you might want to follow
    me on twitter for blog post updates and
    buy me an espresso or paypal.me.

    Buy me an EspressoBuy me an Espresso

    To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Continue Reading…

    Collapse

    Read More

    December 14, 2018

    Spark + AI Summit: learn best practices in ML and DL, latest frameworks, and more – special KDnuggets offer

    Check agenda for the Spark + AI Summit in San Francisco on April 23-25, 2019, comprising of 12 technical tracks on data and AI across verticals, and get the biggest discount: $700 off until Dec 31.

    Continue Reading…

    Collapse

    Read More

    LoyaltyOne: Manager, CPG [Westborough, MA]

    LoyaltyOne is seeking a Manager, CPG in Westborough, MA. be responsible for the overall management of client engagements on the CPG team, leveraging your analytical background and related CPG experience to motivate others to action both internally and with our external clients.

    Continue Reading…

    Collapse

    Read More

    Document worth reading: “Small Sample Learning in Big Data Era”

    As a promising area in artificial intelligence, a new learning paradigm, called Small Sample Learning (SSL), has been attracting prominent research attention in the recent years. In this paper, we aim to present a survey to comprehensively introduce the current techniques proposed on this topic. Specifically, current SSL techniques can be mainly divided into two categories. The first category of SSL approaches can be called ‘concept learning’, which emphasizes learning new concepts from only few related observations. The purpose is mainly to simulate human learning behaviors like recognition, generation, imagination, synthesis and analysis. The second category is called ‘experience learning’, which usually co-exists with the large sample learning manner of conventional machine learning. This category mainly focuses on learning with insufficient samples, and can also be called small data learning in some literatures. More extensive surveys on both categories of SSL techniques are introduced and some neuroscience evidences are provided to clarify the rationality of the entire SSL regime, and the relationship with human learning process. Some discussions on the main challenges and possible future research directions along this line are also presented. Small Sample Learning in Big Data Era

    Continue Reading…

    Collapse

    Read More

    LoyaltyOne: Consultant Category Manager / Analyst, Client Services [Westborough, MA]

    Precima is seeking a Consultant Category Manager / Analyst, Client Services in Westborough, MA, to work collaboratively with our client and consulting analytics team on behalf of Precima clients to develop insight-driven recommendations.

    Continue Reading…

    Collapse

    Read More

    Learning R: A gentle introduction to higher-order functions

    (This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers)

    Have you ever thought about why the definition of a function in R is different from many other programming languages? The part that causes the biggest difficulties (especially for beginners of R) is that you state the name of the function at the beginning and use the assignment operator – as if functions were like any other data type, like vectors, matrices or data frames…

    Congratulations! You just encountered one of the big ideas of functional programming: functions are indeed like any other data type, they are not special – or in programming lingo, functions are first-class members. Now, you might ask: So what? Well, there are many ramifications, for example that you could use functions on other functions by using one function as an argument for another function. Sounds complicated?

    In mathematics most of you will be familiar with taking the derivative of a function. When you think about it you could say that you put one function into the derivative function (or operator) and get out another function!

    In R there are many applications as well, let us go through a simple example step by step.

    Let’s say I want to apply the mean function on the first four columns of the iris dataset. I could do the following:

    mean(iris[ , 1])
    ## [1] 5.843333
    mean(iris[ , 2])
    ## [1] 3.057333
    mean(iris[ , 3])
    ## [1] 3.758
    mean(iris[ , 4])
    ## [1] 1.199333
    

    Quite tedious and not very elegant. Of course, we can use a for loop for that:

    for (x in iris[1:4]) {
      print(mean(x))
    }
    ## [1] 5.843333
    ## [1] 3.057333
    ## [1] 3.758
    ## [1] 1.199333
    

    This works fine but there is an even more intuitive approach. Just look at the original task: “apply the mean function on the first four columns of the iris dataset” – so let us do just that:

    apply(iris[1:4], 2, mean)
    ## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
    ##     5.843333     3.057333     3.758000     1.199333
    

    Wow, this is very concise and works perfectly (the 2 just stands for “go through the data column wise”, 1 would be for “row wise”). apply is called a “higher-order function” and we could use it with all kinds of other functions:

    apply(iris[1:4], 2, sd)
    ## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
    ##    0.8280661    0.4358663    1.7652982    0.7622377
    apply(iris[1:4], 2, min)
    ## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
    ##          4.3          2.0          1.0          0.1
    apply(iris[1:4], 2, max)
    ## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
    ##          7.9          4.4          6.9          2.5
    

    You can also use user-defined functions:

    midrange <- function(x) (min(x) + max(x)) / 2
    apply(iris[1:4], 2, midrange)
    ## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
    ##         6.10         3.20         3.95         1.30
    

    We can even use new functions that are defined “on the fly” (or in functional programming lingo “anonymous functions”):

    apply(iris[1:4], 2, function(x) (min(x) + max(x)) / 2)
    ## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
    ##         6.10         3.20         3.95         1.30
    

    Let us now switch to another inbuilt data set, the mtcars dataset with 11 different variables of 32 cars (if you want to find out more, please consult the documentation):

    head(mtcars)
    ##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
    ## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
    ## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
    ## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
    ## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
    ## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
    ## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
    

    To see the power of higher-order functions let us create a (numeric) matrix with minimum, first quartile, median, mean, third quartile and maximum for all 11 columns of the mtcars dataset with just one command!

    apply(mtcars, 2, summary)
    ##              mpg    cyl     disp       hp     drat      wt     qsec     vs      am   gear   carb
    ## Min.    10.40000 4.0000  71.1000  52.0000 2.760000 1.51300 14.50000 0.0000 0.00000 3.0000 1.0000
    ## 1st Qu. 15.42500 4.0000 120.8250  96.5000 3.080000 2.58125 16.89250 0.0000 0.00000 3.0000 2.0000
    ## Median  19.20000 6.0000 196.3000 123.0000 3.695000 3.32500 17.71000 0.0000 0.00000 4.0000 2.0000
    ## Mean    20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125
    ## 3rd Qu. 22.80000 8.0000 326.0000 180.0000 3.920000 3.61000 18.90000 1.0000 1.00000 4.0000 4.0000
    ## Max.    33.90000 8.0000 472.0000 335.0000 4.930000 5.42400 22.90000 1.0000 1.00000 5.0000 8.0000
    

    Wow, that was easy and the result is quite impressive, is it not!

    Or if you want to perform a linear regression for all ten variables separately against mpg and want to get a table with all coefficients – there you go:

    sapply(mtcars, function(x) round(coef(lm(mpg ~ x, data = mtcars)), 3))
    ##             mpg    cyl   disp     hp   drat     wt   qsec     vs     am  gear   carb
    ## (Intercept)   0 37.885 29.600 30.099 -7.525 37.285 -5.114 16.617 17.147 5.623 25.872
    ## x             1 -2.876 -0.041 -0.068  7.678 -5.344  1.412  7.940  7.245 3.923 -2.056
    

    Here we used another higher-order function, sapply, together with an anonymous function. sapply goes through all the columns of a data frame (i.e. elements of a list) and tries to simplify the result (here your get back a nice matrix).

    Often, you might not even have realised when you were using higher-order functions! I can tell you that it is quite a hassle in many programming languages to program a simple function plotter, i.e. a function which plots another function. In R it has already been done for you: you just use the higher-order function curve and give it the function you want to plot as an argument:

    curve(sin(x) + cos(1/2 * x), -10, 10)
    

    I want to give you one last example of another very helpful higher-order function (which not too many people know or use): by. It comes in very handy when you want to apply a function on different attributes split by a factor. So let’s say you want to get a summary of all the attributes of iris split by (!) species – here it comes:

    by(iris[1:4], iris$Species, summary)
    ## iris$Species: setosa
    ##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
    ##  Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100  
    ##  1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200  
    ##  Median :5.000   Median :3.400   Median :1.500   Median :0.200  
    ##  Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246  
    ##  3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300  
    ##  Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600  
    ## --------------------------------------------------------------
    ## iris$Species: versicolor
    ##   Sepal.Length    Sepal.Width     Petal.Length   Petal.Width   
    ##  Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000  
    ##  1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200  
    ##  Median :5.900   Median :2.800   Median :4.35   Median :1.300  
    ##  Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326  
    ##  3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500  
    ##  Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800  
    ## --------------------------------------------------------------
    ## iris$Species: virginica
    ##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
    ##  Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400  
    ##  1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800  
    ##  Median :6.500   Median :3.000   Median :5.550   Median :2.000  
    ##  Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026  
    ##  3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300  
    ##  Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500
    

    This was just a very shy look at this huge topic. There are very powerful higher-order functions in R, like lapppy, aggregate, replicate (very handy for numerical simulations) and many more. A good overview can be found in the answers of this question: stackoverflow (my answer there is on the rather illusive switch function: switch).

    For some reason people tend to confuse higher-order functions with recursive functions but that is the topic of another post, so stay tuned…

    To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Continue Reading…

    Collapse

    Read More

    Because it's Friday: CGI you never knew was CGI

    Computer-generated imagery in movies has gotten so good these days, much of the time you don't even realize it's there. You probably never noticed how Michael Cera's physique had been altered, or how Lost in Translation used motion capture technology from the future.

    That's all from the blog team for this week. Have a great weekend, and see you next week!

    Continue Reading…

    Collapse

    Read More

    Book Memo: “Data Science for Healthcare”

    Methodologies and Applications
    This book seeks to promote the exploitation of data science in healthcare systems. The focus is on advancing the automated analytical methods used to extract new knowledge from data for healthcare applications. To do so, the book draws on several interrelated disciplines, including machine learning, big data analytics, statistics, pattern recognition, computer vision, and Semantic Web technologies, and focuses on their direct application to healthcare. Building on three tutorial-like chapters on data science in healthcare, the following eleven chapters highlight success stories on the application of data science in healthcare, where data science and artificial intelligence technologies have proven to be very promising. This book is primarily intended for data scientists involved in the healthcare or medical sector. By reading this book, they will gain essential insights into the modern data science technologies needed to advance innovation for both healthcare businesses and patients. A basic grasp of data science is recommended in order to fully benefit from this book.

    Continue Reading…

    Collapse

    Read More

    Introducing repo2docker

    The Binder Project’s repo2docker tool gives data scientists the benefits of containerization technology without needing to learn Docker itself. To make your repository compatible with repo2docker, you only need to add text files that are already present in many repositories. This means that you get the benefits of containerization, a powerful and complex ecosystem, without having to change your workflow.

    repo2docker is a lightweight command-line tool written in Python that takes a path or URL to a git repository and creates a suitable docker image for it. To achieve this it follows the steps that a human would take to do so. The steps are:

    1. Inspect the repository for common “configuration” files (like requirements.txt),
    2. From these well-known files infer the Docker commands to run; and
    3. Build a Docker image.

    It has a few more tricks up its sleeve, such as automatically installing RStudio for you when it detects that you are using R. Once the image has been built, a Docker container is created and executed, giving you access to the environment in which the repository author wanted the code to be executed. To achieve this, one needs access to two things: repo2docker and a docker daemon (they do not necessarily have to have docker installed on their local computer).

    The JupyterHub team just released v0.7 of repo2docker, so we decided to spend a bit of time explaining what it’s all about.

    An example repo2docker workflow. In this case, repo2docker is invoked locally. repo2docker is passed a URL to a git repository (https://github.com/norvig/pytudes). It then clones the repository, discovers configuration files in the repo (in this case, `requirements.txt`), builds a Docker image with this environment installed, and opens a local Jupyter server to explore and run the contents of the repo.

    The guiding principles behind repo2docker

    repo2docker is meant to be as lightweight and common-sense as possible. The driving principles behind repo2docker are as follows:

    1. Leverage pre-existing workflows in data science as much as possible. This means using standard configuration files (like requirements.txt) instead of requiring people to learn new configuration patterns.
    2. The shareable unit is a repository or directory containing human-readable files. Not a single file (like a notebook) nor a binary blob (like a built docker image). This means that humans can inspect and extend other repositories meant for repo2docker, and that they can manually do what repo2docker does automatically. No black box.
    3. Be workflow agnostic. repo2docker supports many languages and user interfaces, it can run arbitrary shell scripts that are baked into the image, or it can trigger a script to be run each time a person runs the Docker image.
    4. Be extensible and composable. repo2docker should allow for multiple languages, tools, or workflows to be defined in a single GitHub repository. It should also be relatively easy to extend to support new use-cases.
    5. Enable deterministic outputs. We want repo2docker to make it possible for authors to generate the exact same environment from their repository every time, provided that they follow best-practices in computational methods (like providing specific version numbers for packages). repo2docker can build a specific commit, tag, or branch of a repository, which allows for an image to be deterministically built.

    How can repo2docker be used?

    Over the last 18 months, we have been using repo2docker in production to automatically generate images that run repositories for mybinder.org.
    It is used to build around 1000 unique repositories every week. The core functionality has proven itself and is considered production ready.

    Over the last year, we’ve seen a few major use-cases come out of repo2docker:

    First, it can be used as a part of production systems like BinderHub. BinderHub automatically uses repo2docker to build images that run a user’s environment, and lets them share links that let others interact with the image.

    Second, repo2docker can be used to build an image for use with a JupyterHub. For example, teachers have used repo2docker to convert their GitHub repository with course materials into a runnable Docker image that students access via a shared jupyterhub in the cloud.

    Finally, repo2docker has been used by individuals who wish to build reproducible images from their local work. repo2docker can optionally run a Jupyter server from within the built image, which makes it possible to verify the results of analyses in an environment that was built solely from the configuration files present in the repository.

    What next?

    We think that repo2docker serves as a useful tool for the community and that it is an important part of the large reproducible scientific software stack. It gives data scientists the benefits of containerization technology without needing to learn a new tool like Docker. It achieves this by being a lightweight command-line tool written in Python that automates the creation of the environment in which the authors of a piece of software wanted it to be executed.

    We’d love to see the repo2docker community grow, and for more
    languages, interfaces, use-cases, and workflows to be supported
    with repo2docker’s build pack system. Let us know what you think!

    repo2docker is primarily maintained by the JupyterHub and Binder teams. If you’d like to get involved with the community or want to learn
    more about the tool, reach out! Check out these links for more information:

    Note: some folks might be wondering why we developed repo2docker instead of contributing to a pre-existing containerization tool such as the excellent source2image project. We take the decision to create new open-source tech very seriously, and wrote a blog post about our decision to do-so in this case: http://words.yuvi.in/post/why-not-s2i/


    Introducing repo2docker was originally published in Jupyter Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Continue Reading…

    Collapse

    Read More

    R Packages worth a look

    Filtering Algorithms for the State Space Model on the Stiefel Manifold (SMFilter)
    Provides the filtering algorithms for the state space model on the Stiefel manifold.

    Real Time Monitoring of Asset Markets: Bubbles and Crisis (psymonitor)
    Apply the popular real-time monitoring strategy proposed by Phillips, Shi and Yu (2015a,b;PSY) <doi:10.1111/iere.12132>, <doi:10.1111/iere.121 …

    Cramer-von Mises Goodness-of-Fit Tests (cvmgof)
    It is devoted to Cramer-von Mises goodness-of-fit tests. It implements three statistical methods based on Cramer-von Mises statistics to estimate and t …

    Multiple Random Dot Product Graphs (multiRDPG)
    Fits the Multiple Random Dot Product Graph Model and performs a test for whether two networks come from the same distribution. Both methods are propose …

    Continue Reading…

    Collapse

    Read More

    Top Stories of 2018: 9 Must-have skills you need to become a Data Scientist, updated; Python eats away at R: Top Software for Analytics, Data Science, Machine Learning

    Also 5 Data Science Projects That Will Get You Hired in 2018; Top 20 Python AI and Machine Learning Open Source Projects; Neural network AI is simple. So... Stop pretending you are a genius.

    Continue Reading…

    Collapse

    Read More

    A REST API for Principal Component Analysis

    As part of our PCA release, we have released a series of blog posts, including a use case and a demonstration of the BigML Dashboard. In this installment, we shift our focus to implement Principal Component analysis with the BigML REST API. PCA is a powerful data transformation technique and unsupervised Machine Learning method that […]

    Continue Reading…

    Collapse

    Read More

    NLP Breakthrough Imagenet Moment has arrived

    A comprehensive review of the current state of Natural Language Processing, covering the process from shallow to deep pre-training, what's in an ImageNet, the case for language modelling, and more.

    Continue Reading…

    Collapse

    Read More

    In case you missed it: November 2018 roundup

    In case you missed them, here are some articles from November of particular interest to R users.

    David Gerard assesses the plausibility of a key plot point in 'Jurassic Park' with simulations in R.

    In-database R is available in Azure SQL Database for private preview. 

    Introducing AzureR, a new suite of R packages for managing Azure resources in R.

    The AzureRMR package provides an interface for Resource Manager.

    Roundup of AI, Machine Learning and Data Science news from November 2018.

    You can now use the AI capabilities of Microsoft Cognitive Services within a container you host.

    A look back at some of the R applications presented at the EARL conference in Seattle.

    Slides and notebooks from my ODSC workshop, AI for Good.

    T-Mobile uses AI models implemented with R to streamline customer service.

    A guide to R packages for importing and working with US Census data.

    Azure Machine Learning Studio, the online drag-and-drop data analysis tool, upgrades its R support.

    And some general interest stories (not necessarily related to R):

    As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here

    Continue Reading…

    Collapse

    Read More

    In case you missed it: November 2018 roundup

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    In case you missed them, here are some articles from November of particular interest to R users.

    David Gerard assesses the plausibility of a key plot point in 'Jurassic Park' with simulations in R.

    In-database R is available in Azure SQL Database for private preview. 

    Introducing AzureR, a new suite of R packages for managing Azure resources in R.

    The AzureRMR package provides an interface for Resource Manager.

    Roundup of AI, Machine Learning and Data Science news from November 2018.

    You can now use the AI capabilities of Microsoft Cognitive Services within a container you host.

    A look back at some of the R applications presented at the EARL conference in Seattle.

    Slides and notebooks from my ODSC workshop, AI for Good.

    T-Mobile uses AI models implemented with R to streamline customer service.

    A guide to R packages for importing and working with US Census data.

    Azure Machine Learning Studio, the online drag-and-drop data analysis tool, upgrades its R support.

    And some general interest stories (not necessarily related to R):

    As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Continue Reading…

    Collapse

    Read More

    Implementing ResNet with MXNET Gluon and Comet.ml for Image Classification

    Whether MXNet is an entirely new framework for you or you have used the MXNet backend while training your Keras models, this tutorial illustrates how to build an image recognition model with an MXNet resnet_v1 model.

    Continue Reading…

    Collapse

    Read More

    The tradeoff between privacy and convenience

    Debby-hudson-543573-love-smThe Gillian Brockell letter (see previous post) can be read as a consumer complaint letter but also a love letter to the tech industry. 

    Reader Antonio R. who forwarded the column to me via Twitter raised this interesting question: 

    Her conclusion seems to be: more relevant ads is better than no ads at all. What future is waiting for cheated fe/males? A warning "Be careful to your partner" or a reassuring "All is well" to choose in advance among app settings?

    Gillian is someone who totally buys into the tech industry's "big data" pitch - that the more you share, the more you gain. She's writing tags that cue algorithms to send her relevant ads. Presumably, when she was pregnant, she was satisfied with the ads that at the time were selling her relevant products.

    She's mad that the algorithm is not all-knowing, personalized and omnipotent. She expects that Facebook, Instagram, Amazon, etc. tracks her every move, and optimizes her experience just for her. She's angry when it makes mistakes.

    And, if one reads behind the lines, her proposed solution is for the tech industry to be even more creepy, gather even more personal data, be even more personalized. She wants ads, just not the ones she doesn't like.

    This solution is not radical at all. In fact, it is exactly what tech firms have been doing for 10 years. The "theory" is: data make ads more relevant, and if ads are not relevant enough, it is because they do not have enough personal data. In this sense, Gillian's column is a love letter to the tech industry.

    ***

    The overlooked solution is to have less relevant ads or no ads at all.

    In the Charles Duhigg story about Target's pregnancy prediction model (see Numbersense), one of the curious nuggets we learned is that the data scientists deliberately mixed random products in between the pregnancy goods being marketed to the women predicted to be pregnant. The official explanation was to make the brochures appear less creepy.

    In the book, I suggested a different explanation for that decision. In a predictive model like that, there are likely to be multiples more false positives (i.e. women wrongly predicted to be pregnant and thus sent irrelevant materials) than true positives (i.e. women correctly predicted to be pregnant). I also speculated that many true positives would act like Gillian did - appreciating the pregnancy product ads as relevant rather than creepy. However, I believe that the false positives will complain that the pregnancy product ads are irrelevant, maybe even somewhat offensive. 

    Mixing in other products lessens the bad of the wrong predictions - but simultaneously, it will also soften the impact of the correct predictions. What's in the balance is consumer interests versus advertiser business goals.

     

     

     

     

     

    Continue Reading…

    Collapse

    Read More

    Why You Shouldn’t be a Data Science Generalist

    But it’s hard to avoid becoming a generalist if you don’t know which common problem classes you could specialize in in the fist place. That’s why I put together a list of the five problem classes that are often lumped together under the “data science” heading.

    Continue Reading…

    Collapse

    Read More

    A couple of thoughts regarding the hot hand fallacy fallacy

    For many years we all believed the hot hand was a fallacy. It turns out we were all wrong. Fine. Such reversals happen.

    Anyway, now that we know the score, we can reflect on some of the cognitive biases that led us to stick with the “hot hand fallacy” story for so long.

    Jason Collins writes:

    Apart from the fact that this statistical bias slipped past everyone’s attention for close to thirty years, I [Collins] find this result extraordinarily interesting for another reason. We have a body of research that suggests that even slight cues in the environment can change our actions. Words associated with old people can slow us down. Images of money can make us selfish. And so on. Yet why haven’t these same researchers been asking why a basketball player would not be influenced by their earlier shots – surely a more salient part of the environment than the word “Florida”? The desire to show one bias allowed them to overlook another.

    Also I was thinking a bit more about the hot hand, in particular a flaw in the underlying logic of Gilovich etc (and also me, before Miller and Sanjurjo convinced me about the hot hand): The null model is that each player j has a probability p_j of making a given shot, and that p_j is constant for the player (considering only shots of some particular difficulty level). But where does p_j come from? Obviously players improve with practice, with game experience, with coaching, etc. So p_j isn’t really a constant. But if “p” varies among players, and “p” varies over the time scale of years or months for individual players, why shouldn’t “p” vary over shorter time scales too? In what sense is “constant probability” a sensible null model at all?

    I can see that “constant probability for any given player during a one-year period” is a better model than “p varies wildly from 0.2 to 0.8 for any player during the game.” But that’s a different story. The more I think about the “there is no hot hand” model, the more I don’t like it as any sort of default.

    In any case, it’s good to revisit our thinking about these theories in light of new arguments and new evidence.

    The post A couple of thoughts regarding the hot hand fallacy fallacy appeared first on Statistical Modeling, Causal Inference, and Social Science.

    Continue Reading…

    Collapse

    Read More

    CBH Group: Sr Data Engineer [Perth, Australia]

    CBH Group is seeking a Sr Data Engineer in Perth, Australia, to take responsibility for the definition, documentation and completion of technical data analysis and delivery.

    Continue Reading…

    Collapse

    Read More

    Ages in Congress, from the 1st to the 115th

    As I watched Google’s CEO Sundar Pichai field questions from the House Judiciary Committee it was hard not to feel like there was a big gap in how the internet works and how members of Congress think it works. Many suggested the gap was related to age, so I couldn’t help but wonder how the age distribution has changed over the years.

    You can see the median age shifting older, but I’m not totally sure what to make of it. After all, the population as a whole is getting older too. On the other hand, the internet changed a lot of things in our lives, and the hope is that those forming the policies understand the ins and outs.

    Tags: ,

    Continue Reading…

    Collapse

    Read More

    Four short links: 14 December 2018

    Satellite LoRaWAN, Bret Victor, State of AI, and Immutable Documentation

    1. Fleet -- launched satellites as backhaul for LoRaWAN base station traffic.
    2. Computing is Everywhere -- podcast episode with Bret Victor. Lots of interesting history and context to what he's up to at Dynamicland. (via Paul Ford)
    3. AI Index 2018 Report (Stanford) -- think of it as the Mary Meeker report for AI.
    4. Etsy's Experiment with Immutable Documentation -- In trying to overcome the problem of staleness, the crucial observation is that how-docs typically change faster than why-docs do. Therefore the more how-docs are mixed in with why-docs in a doc page, the more likely the page is to go stale. We’ve leveraged this observation by creating an entirely separate system to hold our how-docs.

    Continue reading Four short links: 14 December 2018.

    Continue Reading…

    Collapse

    Read More

    Day 14 – little helper print_fs

    (This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

    We at STATWORX work a lot with R and we often use the same little helper functions within our projects. These functions ease our daily work life by reducing repetitive code parts or by creating overviews of our projects. At first, there was no plan to make a package, but soon I realised, that it will be much easier to share and improve those functions, if they are within a package. Up till the 24th December I will present one function each day from helfRlein. So, on the 14th day of Christmas my true love gave to me…

    door-14-print-fs

    What can it do?

    This little helper returns the folder structure of a given path. With this, one can for example add a nice overview to the documentation of a project or within a git. For the sake of automation, this function could run and change parts wihtin a log or news file after a major change.

    How to use it?

    If we take a look at the same example we used for the get_network function on day 5, we get the following:

    print_fs("~/flowchart/", depth = 4)
    
    1  flowchart                            
    2   ¦--create_network.R                 
    3   ¦--getnetwork.R                     
    4   ¦--plots                            
    5   ¦   ¦--example-network-helfRlein.png
    6   ¦   °--improved-network.png         
    7   ¦--R_network_functions              
    8   ¦   ¦--dataprep                     
    9   ¦   ¦   °--foo_01.R                 
    10  ¦   ¦--method                       
    11  ¦   ¦   °--foo_02.R                 
    12  ¦   ¦--script_01.R                  
    13  ¦   °--script_02.R                  
    14  °--README.md 
    

    With depth we can adjust how deep we want to traverse through our folders.

    Overview

    To see all the other functions you can either check out our GitHub or you can read about them here.

    Have a merry advent season!

    Über den Autor

    Jakob Gepp

    Jakob Gepp

    Numbers were always my passion and as a data scientist and statistician at STATWORX I can fullfill my nerdy needs. Also I am responsable for our blog. So if you have any questions or suggestions, just send me an email!

    ABOUT US


    STATWORX
    is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. 

    Sign Up Now!

    Der Beitrag Day 14 – little helper print_fs erschien zuerst auf STATWORX.

    To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Continue Reading…

    Collapse

    Read More

    running plot [and simulated annealing]

    (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

    Last weekend, I found out a way to run updated plots within a loop in R, when calling plot() within the loop was never updated in real time. The above suggestion of including a Sys.sleep(0.25) worked perfectly on a simulated annealing example for determining the most dispersed points in a unit disc.

    To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Continue Reading…

    Collapse

    Read More

    If you did not already know

    Algebraic Machine Learning google
    Machine learning algorithms use error function minimization to fit a large set of parameters in a preexisting model. However, error minimization eventually leads to a memorization of the training dataset, losing the ability to generalize to other datasets. To achieve generalization something else is needed, for example a regularization method or stopping the training when error in a validation dataset is minimal. Here we propose a different approach to learning and generalization that is parameter-free, fully discrete and that does not use function minimization. We use the training data to find an algebraic representation with minimal size and maximal freedom, explicitly expressed as a product of irreducible components. This algebraic representation is shown to directly generalize, giving high accuracy in test data, more so the smaller the representation. We prove that the number of generalizing representations can be very large and the algebra only needs to find one. We also derive and test a relationship between compression and error rate. We give results for a simple problem solved step by step, hand-written character recognition, and the Queens Completion problem as an example of unsupervised learning. As an alternative to statistical learning, \enquote{algebraic learning} may offer advantages in combining bottom-up and top-down information, formal concept derivation from data and large-scale parallelization. …

    Recommendation Engine of Multilayers (REM) google
    Recommender systems have been widely adopted by electronic commerce and entertainment industries for individualized prediction and recommendation, which benefit consumers and improve business intelligence. In this article, we propose an innovative method, namely the recommendation engine of multilayers (REM), for tensor recommender systems. The proposed method utilizes the structure of a tensor response to integrate information from multiple modes, and creates an additional layer of nested latent factors to accommodate between-subjects dependency. One major advantage is that the proposed method is able to address the ‘cold-start’ issue in the absence of information from new customers, new products or new contexts. Specifically, it provides more effective recommendations through sub-group information. To achieve scalable computation, we develop a new algorithm for the proposed method, which incorporates a maximum block improvement strategy into the cyclic blockwise-coordinate-descent algorithm. In theory, we investigate both algorithmic properties for global and local convergence, along with the asymptotic consistency of estimated parameters. Finally, the proposed method is applied in simulations and IRI marketing data with 116 million observations of product sales. Numerical studies demonstrate that the proposed method outperforms existing competitors in the literature. …

    Weighted Object k-Means google
    Weighted object version of k-means algorithm, robust against outlier data. …

    Continue Reading…

    Collapse

    Read More

    Whats new on arXiv

    Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs

    In practice, it is common to find oneself with far too little text data to train a deep neural network. This ‘Big Data Wall’ represents a challenge for minority language communities on the Internet, organizations, laboratories and companies that compete the GAFAM (Google, Amazon, Facebook, Apple, Microsoft). While most of the research effort in text data augmentation aims on the long-term goal of finding end-to-end learning solutions, which is equivalent to ‘using neural networks to feed neural networks’, this engineering work focuses on the use of practical, robust, scalable and easy-to-implement data augmentation pre-processing techniques similar to those that are successful in computer vision. Several text augmentation techniques have been experimented. Some existing ones have been tested for comparison purposes such as noise injection or the use of regular expressions. Others are modified or improved techniques like lexical replacement. Finally more innovative ones, such as the generation of paraphrases using back-translation or by the transformation of syntactic trees, are based on robust, scalable, and easy-to-use NLP Cloud APIs. All the text augmentation techniques studied, with an amplification factor of only 5, increased the accuracy of the results in a range of 4.3% to 21.6%, with significant statistical fluctuations, on a standardized task of text polarity prediction. Some standard deep neural network architectures were tested: the multilayer perceptron (MLP), the long short-term memory recurrent network (LSTM) and the bidirectional LSTM (biLSTM). Classical XGBoost algorithm has been tested with up to 2.5% improvements.


    GC-LSTM: Graph Convolution Embedded LSTM for Dynamic Link Prediction

    Dynamic link prediction is a research hot in complex networks area, especially for its wide applications in biology, social network, economy and industry. Compared with static link prediction, dynamic one is much more difficult since network structure evolves over time. Currently most researches focus on static link prediction which cannot achieve expected performance in dynamic network. Aiming at low AUC, high Error Rate, add/remove link prediction difficulty, we propose GC-LSTM, a Graph Convolution Network (GC) embedded Long Short Term Memory network (LTSM), for end-to-end dynamic link prediction. To the best of our knowledge, it is the first time that GCN embedded LSTM is put forward for link prediction of dynamic networks. GCN in this new deep model is capable of node structure learning of network snapshot for each time slide, while LSTM is responsible for temporal feature learning for network snapshot. Besides, current dynamic link prediction method can only handle removed links, GC-LSTM can predict both added or removed link at the same time. Extensive experiments are carried out to testify its performance in aspects of prediction accuracy, Error Rate, add/remove link prediction and key link prediction. The results prove that GC-LSTM outperforms current state-of-art method.


    Rethink and Redesign Meta learning

    Recently, Meta-learning has been shown as a promising way to improve the ability of learning from few data for Computer Vision. However, previous Meta-learning approaches exposed below problems: 1) they ignored the importance of attention mechanism for Meta-learner, leading the Meta-learner to be interfered by unimportant information; 2) they ignored the importance of past knowledge which can help the Meta-learner accurately understand the input data and further express them into high representations, and they train the Meta-learner to solve few shot learning task directly on the few original input data instead of on the high representations; 3) they suffer from a problem which we named as task-over-fitting (TOF) problem, which is probably caused by that they are requested to solve few shot learning task based on the original high dimensional input data, and redundant input information leads themselves to be easier to suffer from TOF. In this paper, we rethink the Meta-learning algorithm and propose that the attention mechanism and the past knowledge are crucial for the Meta-learner, and the Meta-learner should well use its past knowledge and express the input data into high representations to solve few shot learning tasks. Moreover, the Meta-learning approach should be free from the TOF problem. Based on these arguments, we redesign the Meta-learning algorithm to solve these three aforementioned problems, and proposed three methods. Extensive experiments demonstrate the effectiveness of our designation and methods with state-of-the-art performances on several few shot learning benchmarks. The source code of our proposed methods will be released soon.


    Anomaly Generation using Generative Adversarial Networks in Host Based Intrusion Detection

    Generative adversarial networks have been able to generate striking results in various domains. This generation capability can be general while the networks gain deep understanding regarding the data distribution. In many domains, this data distribution consists of anomalies and normal data, with the anomalies commonly occurring relatively less, creating datasets that are imbalanced. The capabilities that generative adversarial networks offer can be leveraged to examine these anomalies and help alleviate the challenge that imbalanced datasets propose via creating synthetic anomalies. This anomaly generation can be specifically beneficial in domains that have costly data creation processes as well as inherently imbalanced datasets. One of the domains that fits this description is the host-based intrusion detection domain. In this work, ADFA-LD dataset is chosen as the dataset of interest containing system calls of small foot-print next generation attacks. The data is first converted into images, and then a Cycle-GAN is used to create images of anomalous data from images of normal data. The generated data is combined with the original dataset and is used to train a model to detect anomalies. By doing so, it is shown that the classification results are improved, with the AUC rising from 0.55 to 0.71, and the anomaly detection rate rising from 17.07% to 80.49%. The results are also compared to SMOTE, showing the potential presented by generative adversarial networks in anomaly generation.


    Considering Race a Problem of Transfer Learning

    As biometric applications are fielded to serve large population groups, issues of performance differences between individual sub-groups are becoming increasingly important. In this paper we examine cases where we believe race is one such factor. We look in particular at two forms of problem; facial classification and image synthesis. We take the novel approach of considering race as a boundary for transfer learning in both the task (facial classification) and the domain (synthesis over distinct datasets). We demonstrate a series of techniques to improve transfer learning of facial classification; outperforming similar models trained in the target’s own domain. We conduct a study to evaluate the performance drop of Generative Adversarial Networks trained to conduct image synthesis, in this process, we produce a new annotation for the Celeb-A dataset by race. These networks are trained solely on one race and tested on another – demonstrating the subsets of the CelebA to be distinct domains for this task.


    Strong-Weak Distribution Alignment for Adaptive Object Detection

    We propose an approach for unsupervised adaptation of object detectors from label-rich to label-poor domains which can significantly reduce annotation costs associated with detection. Recently, approaches that align distributions of source and target images using an adversarial loss have been proven effective for adapting object classifiers. However, for object detection, fully matching the entire distributions of source and target images to each other at the global image level may fail, as domains could have distinct scene layouts and different combinations of objects. On the other hand, strong matching of local features such as texture and color makes sense, as it does not change category level semantics. This motivates us to propose a novel approach for detector adaptation based on strong local alignment and weak global alignment. Our key contribution is the weak alignment model, which focuses the adversarial alignment loss on images that are globally similar and puts less emphasis on aligning images that are globally dissimilar. Additionally, we design the strong domain alignment model to only look at local receptive fields of the feature map. We empirically verify the effectiveness of our approach on several detection datasets comprising both large and small domain shifts.


    Can I trust you more? Model-Agnostic Hierarchical Explanations

    Interactions such as double negation in sentences and scene interactions in images are common forms of complex dependencies captured by state-of-the-art machine learning models. We propose Mah\’e, a novel approach to provide Model-agnostic hierarchical \’explanations of how powerful machine learning models, such as deep neural networks, capture these interactions as either dependent on or free of the context of data instances. Specifically, Mah\’e provides context-dependent explanations by a novel local interpretation algorithm that effectively captures any-order interactions, and obtains context-free explanations through generalizing context-dependent interactions to explain global behaviors. Experimental results show that Mah\’e obtains improved local interaction interpretations over state-of-the-art methods and successfully explains interactions that are context-free.


    Kernel Treelets

    A new method for hierarchical clustering is presented. It combines treelets, a particular multiscale decomposition of data, with a projection on a reproducing kernel Hilbert space. The proposed approach, called kernel treelets (KT), effectively substitutes the correlation coefficient matrix used in treelets with a symmetric, positive semi-definite matrix efficiently constructed from a kernel function. Unlike most clustering methods, which require data sets to be numeric, KT can be applied to more general data and yield a multi-resolution sequence of basis on the data directly in feature space. The effectiveness and potential of KT in clustering analysis is illustrated with some examples.


    Linking Artificial Intelligence Principles

    Artificial Intelligence principles define social and ethical considerations to develop future AI. They come from research institutes, government organizations and industries. All versions of AI principles are with different considerations covering different perspectives and making different emphasis. None of them can be considered as complete and can cover the rest AI principle proposals. Here we introduce LAIP, an effort and platform for linking and analyzing different Artificial Intelligence Principles. We want to explicitly establish the common topics and links among AI Principles proposed by different organizations and investigate on their uniqueness. Based on these efforts, for the long-term future of AI, instead of directly adopting any of the AI principles, we argue for the necessity of incorporating various AI Principles into a comprehensive framework and focusing on how they can interact and complete each other.


    Spatial-Temporal Subset-based Digital Image Correlation: A General Framework

    A comprehensive and systematic framework for easily extending and implementing the spatial-temporal subset-based digital image correlation (DIC) algorithm is presented. The framework decouples the three main factors (shape function, correlation criterion, and optimization algorithm) in DIC, and represents different algorithms in a uniform form. One can freely choose and combine the three factors to meet his own need, or freely add more parameters to extract analytic results. Subpixel translation and a simulated image series with different velocity characters are analyzed using different algorithms based on the proposed framework. And an application of mitigating air disturbance due to heat haze using spatial-temporal DIC (ST-DIC) is demonstrated, proving the applicability of the framework.


    Distributed Anomaly Detection using Autoencoder Neural Networks in WSN for IoT

    Wireless sensor networks (WSN) are fundamental to the Internet of Things (IoT) by bridging the gap between the physical and the cyber worlds. Anomaly detection is a critical task in this context as it is responsible for identifying various events of interests such as equipment faults and undiscovered phenomena. However, this task is challenging because of the elusive nature of anomalies and the volatility of the ambient environments. In a resource-scarce setting like WSN, this challenge is further elevated and weakens the suitability of many existing solutions. In this paper, for the first time, we introduce autoencoder neural networks into WSN to solve the anomaly detection problem. We design a two-part algorithm that resides on sensors and the IoT cloud respectively, such that (i) anomalies can be detected at sensors in a fully distributed manner without the need for communicating with any other sensors or the cloud, and (ii) the relatively more computation-intensive learning task can be handled by the cloud with a much lower (and configurable) frequency. In addition to the minimal communication overhead, the computational load on sensors is also very low (of polynomial complexity) and readily affordable by most COTS sensors. Using a real WSN indoor testbed and sensor data collected over 4 consecutive months, we demonstrate via experiments that our proposed autoencoder-based anomaly detection mechanism achieves high detection accuracy and low false alarm rate. It is also able to adapt to unforeseeable and new changes in a non-stationary environment, thanks to the unsupervised learning feature of our chosen autoencoder neural networks.


    STEP : A Distributed Multi-threading Framework Towards Efficient Data Analytics

    Various general-purpose distributed systems have been proposed to cope with high-diversity applications in the pipeline of Big Data analytics. Most of them provide simple yet effective primitives to simplify distributed programming. While the rigid primitives offer great ease of use to savvy programmers, they probably compromise efficiency in performance and flexibility in data representation and programming specifications, which are critical properties in real systems. In this paper, we discuss the limitations of coarse-grained primitives and aim to provide an alternative for users to have flexible control over distributed programs and operate globally shared data more efficiently. We develop STEP, a novel distributed framework based on in-memory key-value store. The key idea of STEP is to adapt multi-threading in a single machine to a distributed environment. STEP enables users to take fine-grained control over distributed threads and apply task-specific optimizations in a flexible manner. The underlying key-value store serves as distributed shared memory to keep globally shared data. To ensure ease-of-use, STEP offers plentiful effective interfaces in terms of distributed shared data manipulation, cluster management, distributed thread management and synchronization. We conduct extensive experimental studies to evaluate the performance of STEP using real data sets. The results show that STEP outperforms the state-of-the-art general-purpose distributed systems as well as a specialized ML platform in many real applications.


    Real-Time Anomaly Detection With HMOF Feature

    Anomaly detection is a challenging problem in intelligent video surveillance. Most existing methods are computation consuming, which cannot satisfy the real-time requirement. In this paper, we propose a real-time anomaly detection framework with low computational complexity and high efficiency. A new feature, named Histogram of Magnitude Optical Flow (HMOF), is proposed to capture the motion of video patches. Compared with existing feature descriptors, HMOF is more sensitive to motion magnitude and more efficient to distinguish anomaly information. The HMOF features are computed for foreground patches, and are reconstructed by the auto-encoder for better clustering. Then, we use Gaussian Mixture Model (GMM) Classifiers to distinguish anomalies from normal activities in videos. Experimental results show that our framework outperforms state-of-the-art methods, and can reliably detect anomalies in real-time.


    Causal inference, social networks, and chain graphs

    Traditionally, statistical and causal inference on human subjects relies on the assumption that individuals are independently affected by treatments or exposures. However, recently there has been increasing interest in settings, such as social networks, where treatments may spill over from the treated individual to his or her social contacts and outcomes may be contagious. Existing models proposed for causal inference using observational data from networks have two major shortcomings. First, they often require a level of granularity in the data that is not often practically infeasible to collect, and second, the models are generally high-dimensional and often too big to fit to the available data. In this paper we propose and justify a parsimonious parameterization for social network data with interference and contagion. Our parameterization corresponds to a particular family of graphical models known as chain graphs. We demonstrate that, in some settings, chain graph models approximate the observed marginal distribution, which is missing most of the time points from the full data. We illustrate the use of chain graphs for causal inference about collective decision making in social networks using data from U.S. Supreme Court decisions between 1994 and 2004.


    Fission: A Probably Fast, Scalable, and Secure Permissionless Blockchain

    We present Fission, a new permissionless blockchain that achieves scalability in both terms of system throughput and transaction confirmation time, while at the same time, retaining blockchain’s core values of equality and decentralization. Fission overcomes the system throughput bottleneck by employing a novel Eager-Lazy pipeling model that achieves very high system throughputs via block pipelining, an adaptive partitioning mechanism that auto-scales to transaction volumes, and a provably secure energy-efficient consensus protocol to ensure security and robustness. Fission applies a hybrid network which consists of a relay network, and a peer-to-peer network. The goal of the relay network is to minimize the transaction confirmation time by minimizing the information propagation latency. To optimize the performance on the relay network in the presence of churn, dynamic network topologies, and network heterogeneity, we propose an ultra-fast game-theoretic relay selection algorithm that achieves near-optimal performance in a fully distributed manner. Fission’s peer-to-peer network complements the relay network and provides a very high data availability via enabling users to contribute their storage and bandwidth for information dissemination (with incentive). We propose a distributed online data retrieval strategy that optimally offloads the relay network without degrading the system performance. By re-innovating all the core elements of the blockchain technology – computation, networking, and storage – in a holistic manner, Fission aims to achieve the best balance among scalability, security and decentralization.


    Effective Feature Learning with Unsupervised Learning for Improving the Predictive Models in Massive Open Online Courses

    The effectiveness of learning in massive open online courses (MOOCs) can be significantly enhanced by introducing personalized intervention schemes which rely on building predictive models of student learning behaviors such as some engagement or performance indicators. A major challenge that has to be addressed when building such models is to design handcrafted features that are effective for the prediction task at hand. In this paper, we make the first attempt to solve the feature learning problem by taking the unsupervised learning approach to learn a compact representation of the raw features with a large degree of redundancy. Specifically, in order to capture the underlying learning patterns in the content domain and the temporal nature of the clickstream data, we train a modified auto-encoder (AE) combined with the long short-term memory (LSTM) network to obtain a fixed-length embedding for each input sequence. When compared with the original features, the new features that correspond to the embedding obtained by the modified LSTM-AE are not only more parsimonious but also more discriminative for our prediction task. Using simple supervised learning models, the learned features can improve the prediction accuracy by up to 17% compared with the supervised neural networks and reduce overfitting to the dominant low-performing group of students, specifically in the task of predicting students’ performance. Our approach is generic in the sense that it is not restricted to a specific supervised learning model nor a specific prediction task for MOOC learning analytics.


    Recent Advances in Autoencoder-Based Representation Learning

    Learning useful representations with little or no supervision is a key challenge in artificial intelligence. We provide an in-depth review of recent advances in representation learning with a focus on autoencoder-based models. To organize these results we make use of meta-priors believed useful for downstream tasks, such as disentanglement and hierarchical organization of features. In particular, we uncover three main mechanisms to enforce such properties, namely (i) regularizing the (approximate or aggregate) posterior distribution, (ii) factorizing the encoding and decoding distribution, or (iii) introducing a structured prior distribution. While there are some promising results, implicit or explicit supervision remains a key enabler and all current methods use strong inductive biases and modeling assumptions. Finally, we provide an analysis of autoencoder-based representation learning through the lens of rate-distortion theory and identify a clear tradeoff between the amount of prior knowledge available about the downstream tasks, and how useful the representation is for this task.

    Continue Reading…

    Collapse

    Read More

    Book Memo: “Applied Compositional Data Analysis”

    With Worked Examples in R
    This book presents the statistical analysis of compositional data using the log-ratio approach. It includes a wide range of classical and robust statistical methods adapted for compositional data analysis, such as supervised and unsupervised methods like PCA, correlation analysis, classification and regression. In addition, it considers special data structures like high-dimensional compositions and compositional tables. The methodology introduced is also frequently compared to methods which ignore the specific nature of compositional data. It focuses on practical aspects of compositional data analysis rather than on detailed theoretical derivations, thus issues like graphical visualization and preprocessing (treatment of missing values, zeros, outliers and similar artifacts) form an important part of the book. Since it is primarily intended for researchers and students from applied fields like geochemistry, chemometrics, biology and natural sciences, economics, and social sciences, all the proposed methods are accompanied by worked-out examples in R using the package robCompositions.

    Continue Reading…

    Collapse

    Read More

    My book ‘Deep Learning from first principles:Second Edition’ now on Amazon

    (This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

    The second edition of my book ‘Deep Learning from first principles:Second Edition- In vectorized Python, R and Octave’, is now available on Amazon, in both paperback ($14.99)  and kindle ($9.99/Rs449/-)  versions. Since this book is almost 70% code, all functions, and code snippets have been formatted to use the fixed-width font ‘Lucida Console’. In addition line numbers have been added to all code snippets. This makes the code more organized and much more readable. I have also fixed typos in the book

    The book includes the following chapters

    Table of Contents
    Preface 4
    Introduction 6
    1. Logistic Regression as a Neural Network 8
    2. Implementing a simple Neural Network 23
    3. Building a L- Layer Deep Learning Network 48
    4. Deep Learning network with the Softmax 85
    5. MNIST classification with Softmax 103
    6. Initialization, regularization in Deep Learning 121
    7. Gradient Descent Optimization techniques 167
    8. Gradient Check in Deep Learning 197
    1. Appendix A 214
    2. Appendix 1 – Logistic Regression as a Neural Network 220
    3. Appendix 2 - Implementing a simple Neural Network 227
    4. Appendix 3 - Building a L- Layer Deep Learning Network 240
    5. Appendix 4 - Deep Learning network with the Softmax 259
    6. Appendix 5 - MNIST classification with Softmax 269
    7. Appendix 6 - Initialization, regularization in Deep Learning 302
    8. Appendix 7 - Gradient Descent Optimization techniques 344
    9. Appendix 8 – Gradient Check 405
    References 475

    Also see
    1. My book ‘Practical Machine Learning in R and Python: Second edition’ on Amazon
    2. The 3rd paperback & kindle editions of my books on Cricket, now on Amazon
    3. De-blurring revisited with Wiener filter using OpenCV
    4. TWS-4: Gossip protocol: Epidemics and rumors to the rescue
    5. A Cloud medley with IBM Bluemix, Cloudant DB and Node.js
    6. Practical Machine Learning with R and Python – Part 6
    7. GooglyPlus: yorkr analyzes IPL players, teams, matches with plots and tables
    8. Fun simulation of a Chain in Android

    To see posts click Index of Posts

    To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts ….

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Continue Reading…

    Collapse

    Read More

    Distilled News

    Most AI Explainability Is Snake Oil. Ours Isn’t And Here’s Why.

    Advanced machine learning (ML) is a subset of AI that uses more data and sophisticated math to make better predictions and decisions. Banks and lenders could make a lot more money using ML-powered credit scoring instead of legacy methods in use today. But adoption of ML has been held back by the technology’s ‘black-box’ nature: you can see the model’s results but not how it came to those results. You can’t run a credit model safely or accurately if you can’t explain its decisions, especially for a regulated use case such as credit underwriting.


    How to deploy a predictive service to Kubernetes with R and the AzureContainers package

    It’s easy to create a function in R, but what if you want to call that function from a different application, with the scale to support a large number of simultaneous requests? This article shows how you can deploy an R fitted model as a Plumber web service in Kubernetes, using Azure Container Registry (ACR) and Azure Kubernetes Service (AKS). We use the AzureContainers package to create the necessary resources and deploy the service.


    Visualizing Hurricane Data with Shiny

    Around the time that I was selecting a topic for this project, my parents and my hometown found themselves in the path of a Category 1 hurricane. Thankfully, everyone was ok, and there was only minor damage to their property. But this event made me think about how long it had been since the last time my hometown had been in the path of a Category 1 hurricane. I also wanted to study trends in hurricane intensity over time to see if it corresponds to the popular impression that storms have grown stronger storms over the past few years.


    Network Centrality in R: New ways of measuring Centrality

    This is the third post of a series on the concept of ‘network centrality’ with applications in R and the package netrankr. The last part introduced the concept of neighborhood-inclusion and its implications for centrality. In this post, we extend the concept to a broader class of dominance relations by deconstructing indices into a series of building blocks and introduce new ways of evaluating centrality.


    Named Entity Recognition (NER) With Keras and Tensorflow – Meeting Industry’s Requirement by Applying State-of-the-art Deep Learning Methods

    Few years ago when I was working as a software engineering intern at a startup, I saw a new feature in a job posting web-app. The app was able to recognize and parse important information form the resumes like, email address, phone number, degree titles and etc. I started discussing possible approaches with our team and we decided to build a rule based parser in python to just parse different sections of a resume. After spending some time developing the parser, we realized that the answer may not be a rule-based tool. We started googling how it’s done and we came across the term Natural Language Processing (NLP) and more specific, Named Entity Recognition (NER) associated with Machine Learning.


    The Importance of Being Recurrent for Modeling Hierarchical Structure

    Recurrent Neural Networks (RNNs), such as Long Short-Term Memory networks (LSTMs), currently have performance limitations, while newer methods such as Fully Attentional Networks (FANs) show potential for replacing LSTMs without those same limitations. So the authors set out to compare the two approaches using standardized methods and found that LSTMs universally surpass FANs in prediction accuracy when applied to the hierarchy structure of language.


    Supervised Machine Learning: Classification

    In Supervised Learning, algorithms learn from labeled data. After understanding the data, the algorithm determines which label should be given to new data based on pattern and associating the patterns to the unlabeled new data.


    Customer Analysis with Network Science

    Over the past decade or two, Americans have continued to prefer payment methods that are traceable, providing retailers and vendors with a rich source of data on their customers. This data is used by data scientists to help businesses make more informed decisions with respect to inventory, marketing, and supply chain, to name a few. There are several tools and techniques for performing customer segmentation, and network analysis can be a powerful one.


    AWS Architecture For Your Machine Learning Solutions

    One of the regular challenges I face while designing enterprise-grade solutions for our client companies is the lack of reference online on examples of real world architectural use cases. You will find tons of tutorials on how to get started on individual technologies, and these are great when your focus is just limited to that particular framework or service. But in order to evaluate the broad spectrum of all that is available out there and to predetermine the implications of bundling a bunch of these together, you either have to hunt down someone who’s been down the road before, or venture on an independent experimentation yourself. That’s why I decided to start a series on sharing some of my own insights gathered while designing and developing technical solutions for multiple fortune 200 companies and emerging startups. And hopefully, today’s use case will help you plan the AWS Architecture for your Machine Learning solutions.


    AI: the silver bullet to stop Technical Debt from sucking you dry

    It’s Friday evening in the Bahamas. You’re relaxing under a striped red umbrella with a succulent glass of wine and your favorite book?-?it’s a great read and you love the way the ocean breeze moves the pages like leaves on a tree. As the sun descends your eyes follow, your consciousness drifting with the waves, closer to the horizon, closer to a soft, lulling sleep, closer to a perfect evening in a perfect world.


    5 Machine Learning Resolutions for 2019

    More organizations are using machine learning for competitive reasons, but their results are mixed. It turns out there are better — and worse — ways of approaching it. If you want to improve the outcome of your efforts in 2019, consider these points.
    • Start with an appropriate scope
    • Approach machine learning holistically
    • Make the connection between data and machine learning
    • Don’t expect too much ‘out of the box’
    • Don’t forget infrastructural requirements


    XGBoost is not black magic

    Nowadays is quite easy to have decent results in data science tasks: it’s sufficient to have a general understanding of the process, a basic knowledge of Python and ten minutes of your time to instantiate XGBoost and fit the model. Ok, if it’s your first time then you would probably spend a couple of minutes collecting the required packages via pip, but that’s it. The only problem with this approach is that it works pretty well: a couple of years ago I classified in the Top 5 in a university competition by just feeding the dataset to an XGBoost with some basic feature engineering, outperforming groups presenting very complex architectures and data pipelines. One of the coolest characteristics of XGBoost is how it deals with missing values: deciding for each sample which is the best way to impute them. This feature has been super-useful for a lot of projects and datasets I run into during the last months; to be more deserving of the Data Scientist title written under my name, I decided to dig a little deeper, taking a couple of hours to read the original paper, trying to understand what an XGBoost is actually about and how it is able to deal with missing values in the sort of magical way it does.


    Which Model and How Much Data?

    Building deep learning applications in the real world is a never-ending process of selecting and refining the right elements of a specific solution. Among those elements, the selection of the correct model and the right structure of the training dataset are, arguably, the two most important decisions that data scientists need to make when architecting deep learning solutions. How to decide what deep learning model to use for a specific problem? How do we know whether we are using the correct training dataset or we should gather more data? Those questions are the common denominator across all stages of the lifecycle of a deep learning application. Even though there is no magic answer to those questions, there are several ideas that could guide your decision-making process. Let’s start with the selection of the correct deep learning model.

    Continue Reading…

    Collapse

    Read More

    Pdftools 2.0: powerful pdf text extraction tools

    (This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

    cover image

    A new version of pdftools has been released to CRAN. Go get it while it’s hot:

    install.packages("pdftools")
    

    This version has two major improvements: low level text extraction and encoding improvements.

    About PDF textboxes

    A pdf document may seem to contain paragraphs or tables in a viewer, but this is not actually true. PDF is a printing format: a page consists of a series of unrelated lines, bitmaps, and textboxes with a given size, position and content. Hence a table in a pdf file is really just a large unordered set of lines and words that are nicely visually positioned. This makes sense for printing, but makes extracting text or data from a pdf file extremely difficult.

    Because the pdf format has little semantic structure, the pdf_text() function in pdftools has to render the PDF to a text canvas, in order to create the sentences or paragraphs. It does so pretty well, but some users have asked for something more low level.

    Unfortunately this was not trivial because it required some work in the underlying poppler library. One year later, and this functionality is now finally available in the upcoming poppler version 0.73. The pdftools CRAN binary packages for Windows and MacOS already contain a suitable libpoppler, however Linux users probably have to wait for the latest version of poppler to become available in their system package manager (or compile from source).

    Low-level text extraction

    We use an example pdf file from the rOpenSci tabulizer package. This file contains a few standard datasets which have been printed as a pdf table. First let’s try the pdf_text() function, which returns a character vector of length equal to the number of pages in the file.

    library(pdftools)
    pdf_file <- "https://github.com/ropensci/tabulizer/raw/master/inst/examples/data.pdf"
    txt <- pdf_text(pdf_file)
    cat(txt[1])
    
                        mpg  cyl    disp  hp drat    wt  qsec vs am gear carb
    Mazda RX4           21.0   6   160.0 110 3.90 2.620 16.46  0  1    4    4
    Mazda RX4 Wag       21.0   6   160.0 110 3.90 2.875 17.02  0  1    4    4
    Datsun 710          22.8   4   108.0  93 3.85 2.320 18.61  1  1    4    1
    Hornet 4 Drive      21.4   6   258.0 110 3.08 3.215 19.44  1  0    3    1
    Hornet Sportabout   18.7   8   360.0 175 3.15 3.440 17.02  0  0    3    2
    Valiant             18.1   6   225.0 105 2.76 3.460 20.22  1  0    3    1
    Duster 360          14.3   8   360.0 245 3.21 3.570 15.84  0  0    3    4
    Merc 240D           24.4   4   146.7  62 3.69 3.190 20.00  1  0    4    2
    ...
    

    Hence pdf_text() converts all text on a page to a large string, which works pretty well. However if you would want to parse this text into a data frame (using e.g. read.table) you run into a problem: the first column contains spaces within the values. Therefore we can’t use the whitespace as the column delimiter (as is the default in read.table).

    Hence to write a proper pdf table extractor, we have to infer the column from the physical position of the textbox, rather than rely on delimiting characters. The new pdf_data() provides exactly this. It returns a data frame with all textboxes in a page, including their width, height, and (x,y) position:

    # All textboxes on page 1
    pdf_data(pdf_file)[[1]]
    
    # A tibble: 430 x 6
       width height     x     y space text  
              
     1    29      8   154   139 TRUE  Mazda 
     2    19      8   187   139 FALSE RX4   
     3    29      8   154   151 TRUE  Mazda 
     4    19      8   187   151 TRUE  RX4   
     5    19      8   210   151 FALSE Wag   
     6    31      8   154   163 TRUE  Datsun
     7    14      8   189   163 FALSE 710   
     8    30      8   154   176 TRUE  Hornet
     9     4      8   188   176 TRUE  4     
    10    23      8   196   176 FALSE Drive 
    

    Converting this pdf data into the original data frame is left as an exercise for the reader 🙂

    Encoding enhancements

    Apart from the new pdf_data() function, this release also fixes a few smaller problems related to text encoding, both in pdftools package as well as the underlying poppler library. The main issue was a bug related mixing UTF-16BE and UTF-16LE, which is not something you ever want to worry about.

    For most well behaved pdf files there was no problem, but some files using rare encoding could yield an error Embedded NUL in string for metadata, or garbled author or title fields. If you encountered any of these problems in the past, please update your pdftools and try again!

    Other rOpenSci PDF packages

    Besides pdftools we have two other packages that may be helpful to extract data from PDF files:

    • The tesseract package provides R bindings to the Google Tesseract OCR C++ library. This allows for detecting text from scanned images.
    • The tabulizer package provides R bindings to the Tabula java library, which can also be used to extract tables from PDF documents. Note this requires you have a Java installation.

    Using rOpenSci packages? Tell us about your use case and how you make use of our software!

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Continue Reading…

    Collapse

    Read More

    Easy CI/CD of GPU applications on Google Cloud including bare-metal using Gitlab and Kubernetes

    (This article was first published on Angel Sevilla Camins' Blog, and kindly contributed to R-bloggers)

    Summary

    Are you a data scientist who only wants to focus on modelling and coding and not on setting up a GPU cluster? Then, this blog might be interesting for you. We developed an automated pipeline using gitlab and Kubernetes that is able to run code in two GPU environments, GCP and bare-metal; no need to worry about drivers, Kubernetes cluster creation or deletion. The only thing that you should do is to push your code and it runs in a GPU!
    Source code for both the custom Docker images and the Kubernetes objects definitions can be found here and here respectively.
    See here the complete blog post.

    To leave a comment for the author, please follow the link and comment on their blog: Angel Sevilla Camins' Blog.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Continue Reading…

    Collapse

    Read More

    December 13, 2018

    Yet another visualization of the Bayesian Beta-Binomial model

    The Beta-Binomial model is the “hello world” of Bayesian statistics. That is, it’s the first model you get to run, often before you even know what you are doing. There are many reasons for this:

    • It only has one parameter, the underlying proportion of success, so it’s easy to visualize and reason about.
    • It’s easy to come up with a scenario where it can be used, for example: “What is the proportion of patients that will be cured by this drug?”
    • The model can be computed analytically (no need for any messy MCMC).
    • It’s relatively easy to come up with an informative prior for the underlying proportion.
    • Most importantly: It’s fun to see some results before diving into the theory! 😁

    That’s why I also introduced the Beta-Binomial model as the first model in my DataCamp course Fundamentals of Bayesian Data Analysis in R and quite a lot of people have asked me for the code I used to visualize the Beta-Binomial. Scroll to the bottom of this post if that’s what you want, otherwise, here is how I visualized the Beta-Binomial in my course given two successes and four failures:

    The function that produces these plots is called prop_model (prop as in proportion) and takes a vector of TRUEs and FALSEs representing successes and failures. The visualization is created using the excellent ggridges package (previously called joyplot). Here’s how you would use prop_model to produce the last plot in the animation above:

    data <- c(FALSE, TRUE, FALSE, FALSE, FALSE, TRUE)
    prop_model(data)
    

    The result is, I think, a quite nice visualization of how the model’s knowledge about the parameter changes as data arrives. At n=0 the model doesn’t know anything and — as the default prior states that it’s equally likely the proportion of success is anything from 0.0 to 1.0 — the result is a big, blue, and uniform square. As more data arrives the probability distribution becomes more concentrated, with the final posterior distribution at n=6.

    Some added features of prop_model is that it also plots larger data somewhat gracefully and that it returns a random sample from the posterior that can be further explored. For example:

    big_data <- sample(c(TRUE, FALSE), prob = c(0.75, 0.25),
                       size = 100, replace = TRUE)
    posterior <- prop_model(big_data)
    

    quantile(posterior, c(0.025, 0.5, 0.975))
    
    ## 2.5%  50%  98% 
    ## 0.68 0.77 0.84
    

    So here we calculated that the underlying proportion of success is most likely 0.77 with a 95% CI of [0.68, 0.84] (which nicely includes the correct value of 0.75 which we used to simulate big_data).

    To be clear, prop_model is not intended as anything serious, it’s just meant as a nice way of exploring the Beta-Binomial model when learning Bayesian statistics, maybe as part of a workshop exercise.

    The prop_model function

    # This function takes a number of successes and failuers coded as a TRUE/FALSE
    # or 0/1 vector. This should be given as the data argument.
    # The result is a visualization of the how a Beta-Binomial
    # model gradualy learns the underlying proportion of successes 
    # using this data. The function also returns a sample from the
    # posterior distribution that can be further manipulated and inspected.
    # The default prior is a Beta(1,1) distribution, but this can be set using the
    # prior_prop argument.
    
    # Make sure the packages tidyverse and ggridges are installed, otherwise run:
    # install.packages(c("tidyverse", "ggridges"))
    
    # Example usage:
    # data <- c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE)
    # prop_model(data)
    prop_model <- function(data = c(), prior_prop = c(1, 1), n_draws = 10000) {
      library(tidyverse)
      data <- as.logical(data)
      # data_indices decides what densities to plot between the prior and the posterior
      # For 20 datapoints and less we're plotting all of them.
      data_indices <- round(seq(0, length(data), length.out = min(length(data) + 1, 20)))
    
      # dens_curves will be a data frame with the x & y coordinates for the 
      # denities to plot where x = proportion_success and y = probability
      proportion_success <- c(0, seq(0, 1, length.out = 100), 1)
      dens_curves <- map_dfr(data_indices, function(i) {
        value <- ifelse(i == 0, "Prior", ifelse(data[i], "Success", "Failure"))
        label <- paste0("n=", i)
        probability <- dbeta(proportion_success,
                             prior_prop[1] + sum(data[seq_len(i)]),
                             prior_prop[2] + sum(!data[seq_len(i)]))
        probability <- probability / max(probability)
        data_frame(value, label, proportion_success, probability)
      })
      # Turning label and value into factors with the right ordering for the plot
      dens_curves$label <- fct_rev(factor(dens_curves$label, levels =  paste0("n=", data_indices )))
      dens_curves$value <- factor(dens_curves$value, levels = c("Prior", "Success", "Failure"))
    
      p <- ggplot(dens_curves, aes(x = proportion_success, y = label,
                                   height = probability, fill = value)) +
        ggridges::geom_density_ridges(stat="identity", color = "white", alpha = 0.8,
                                      panel_scaling = TRUE, size = 1) +
        scale_y_discrete("", expand = c(0.01, 0)) +
        scale_x_continuous("Underlying proportion of success") +
        scale_fill_manual(values = hcl(120 * 2:0 + 15, 100, 65), name = "", drop = FALSE,
                          labels =  c("Prior   ", "Success   ", "Failure   ")) +
        ggtitle(paste0(
          "Binomial model - Data: ", sum(data),  " successes, " , sum(!data), " failures")) +
        theme_light() +
        theme(legend.position = "top")
      print(p)
    
      # Returning a sample from the posterior distribution that can be further 
      # manipulated and inspected
      posterior_sample <- rbeta(n_draws, prior_prop[1] + sum(data), prior_prop[2] + sum(!data))
      invisible(posterior_sample)
    }
    

    Continue Reading…

    Collapse

    Read More

    Document worth reading: “AI Reasoning Systems: PAC and Applied Methods”

    Learning and logic are distinct and remarkable approaches to prediction. Machine learning has experienced a surge in popularity because it is robust to noise and achieves high performance; however, ML experiences many issues with knowledge transfer and extrapolation. In contrast, logic is easily intepreted, and logical rules are easy to chain and transfer between systems; however, inductive logic is brittle to noise. We then explore the premise of combining learning with inductive logic into AI Reasoning Systems. Specifically, we summarize findings from PAC learning (conceptual graphs, robust logics, knowledge infusion) and deep learning (DSRL, $\partial$ILP, DeepLogic) by reproducing proofs of tractability, presenting algorithms in pseudocode, highlighting results, and synthesizing between fields. We conclude with suggestions for integrated models by combining the modules listed above and with a list of unsolved (likely intractable) problems. AI Reasoning Systems: PAC and Applied Methods

    Continue Reading…

    Collapse

    Read More

    Reusable Pipelines in R

    Pipelines in R are popular, the most popular one being magrittr as used by dplyr.

    This note will discuss the advanced re-usable piping systems: rquery/rqdatatable operator trees and wrapr function object pipelines. In each case we have a set of objects designed to extract extra power from the wrapr dot-arrow pipe %.>%.

    Piping

    Piping is not much more than having a system that lets one treat “x %.>% f(.)” as a near synonym for “f(x)”. For the wrapr dot arrow pipe the semantics are intentionally closer to (x %.>% f(.)) ~ {. <- x; f(.)}.

    The pipe notation may be longer, but it avoids nesting and reversed right to left reading for many-stage operations (such as “x %.>% f1(.) %.>% f2(.) %.>% f3(.)” versus “f3(f2(f1(x)))”).

    In addition to allowing users to write operations in this notation, most piping systems allow users to save pipelines for later re-use (though some others have issues serializing or saving such pipelines due to entanglement with the defining environment).

    wrapr and rquery/rqdatatable supply a number of piping tools that are re-usable, serializable, and very powerful (via R S3 and S4 dispatch features). One of the most compelling features are “function objects” which mans objects can be treated like functions (applied to other objects by pipelines). We will discuss some of these features in the context of rquery/rqdatatable and wrapr.

    rquery/rqdatatable

    For quite a while the rquery and rqdatatable packages have supplied a sequence of operators abstraction called an “operator tree” or “operator pipeline”.

    These pipelines are (deliberately) fairly strict:

    • They must start with a table description or definition.
    • Each step must be a table to table transform meeting certain column pre-conditions.
    • Each step must advertise what columns it makes available or produces, for later condition checking.

    For a guiding example suppose we want to row-subset some data, get per-group means, and then sort the data by those means.

    # our example data
    d <- data.frame(
      group = c("a", "a", "b", "b"),
      value = c(  1,  2,   2,  -10),
      stringsAsFactors = FALSE
    )
    
    # load our package
    library("rqdatatable")
    ## Loading required package: rquery
    # build an operator tree
    threshold <- 0.0
    ops <-
      # define the data format
      local_td(d) %.>%   
      # restrict to rows with value >= threshold
      select_rows_nse(.,
                      value >= threshold) %.>%
      # compute per-group aggegations
      project_nse(.,
                  groupby = "group",
                  mean_value = mean(value)) %.>%
      # sort rows by mean_value decreasing
      orderby(.,
              cols = "mean_value",
              reverse = "mean_value")
    
    # show the tree/pipeline
    cat(format(ops))
    ## table(d; 
    ##   group,
    ##   value) %.>%
    ##  select_rows(.,
    ##    value >= 0) %.>%
    ##  project(., mean_value := mean(value),
    ##   g= group) %.>%
    ##  orderby(., desc(mean_value))

    Of course the purpose of such a pipeline is to be able to apply it to data. This is done simply with the wrapr dot arrow pipe:

    d %.>% ops
    ##    group mean_value
    ## 1:     b        2.0
    ## 2:     a        1.5

    rquery pipelines are designed to specify and execute data wrangling tasks. An important feature of rquery pipelines is: they are designed for serialization. This means we can save them and also send them to multiple nodes for parallel processing.

    # save the optree
    saveRDS(ops, "rquery_optree.RDS")
    
    # simulate a fresh R session
    rm(list=setdiff(ls(), "d"))
    
    library("rqdatatable")
    
    # read the optree back in
    ops <- readRDS('rquery_optree.RDS')
    
    # look at it
    cat(format(ops))
    ## table(d; 
    ##   group,
    ##   value) %.>%
    ##  select_rows(.,
    ##    value >= 0) %.>%
    ##  project(., mean_value := mean(value),
    ##   g= group) %.>%
    ##  orderby(., desc(mean_value))
    # use it again
    d %.>% ops
    ##    group mean_value
    ## 1:     b        2.0
    ## 2:     a        1.5
    # clean up
    rm(list=setdiff(ls(), "d"))

    We can also run rqdatatable operations in “immediate mode”, without pre-defining the pipeline or tables:

    threshold <- 0.0
    
    d %.>%
      select_rows_nse(.,
                      value >= threshold) %.>%
      project_nse(.,
                  groupby = "group",
                  mean_value = mean(value)) %.>%
      orderby(.,
              cols = "mean_value",
              reverse = "mean_value")
    ##    group mean_value
    ## 1:     b        2.0
    ## 2:     a        1.5

    wrapr function objects

    A natural question is: given we already have rquery pipelines why do we need wrapr function object pipelines? The reason is: rquery/rdatatable pipelines are strict and deliberately restricted to operations that can be hosted both in R (via data.table) or on databases (examples: PostgreSQL and Spark). One might also want a more general pipeline with fewer constraints optimized for working in R directly.

    The wrapr “function object” pipelines allow treatment of arbitrary objects as items we can pipe into. Their primary purpose is to partially apply functions to convert arbitrary objects and functions into single-argument (or unary) functions. This converted form is perfect for pipelining. This, in a sense, lets us treat these objects as functions. The wrapr function object pipeline also has less constraint checking than rquery pipelines, so is more suitable for “black box” steps that do not publish their column use and production details (in fact wrapr function object pipelines work on arbitrary objects, not just data.frames or tables).

    Let’s adapt our above example into a simple wrapr dot arrow pipeline.

    library("wrapr")
    
    threshold <- 0
    
    d %.>%
      .[.$value >= threshold, , drop = FALSE] %.>%
      tapply(.$value, .$group, 'mean') %.>%
      sort(., decreasing = TRUE)
    ##   b   a 
    ## 2.0 1.5

    All we have done is replace the rquery steps with typical base-R commands. As we see the wrapr dot arrow can route data through a sequence of such commands to repeat our example.

    Now let’s adapt our above example into a re-usable wrapr function object pipeline.

    library("wrapr")
    
    threshold <- 0
    
    pipeline <-
      srcfn(
        ".[.$value >= threshold, , drop = FALSE]" ) %.>%
      srcfn(
        "tapply(.$value, .$group, 'mean')" ) %.>%
      pkgfn(
        "sort",
        arg_name = "x",
        args = list(decreasing = TRUE))
    
    cat(format(pipeline))
    ## UnaryFnList(
    ##    SrcFunction{ .[.$value >= threshold, , drop = FALSE] }(.=., ),
    ##    SrcFunction{ tapply(.$value, .$group, 'mean') }(.=., ),
    ##    base::sort(x=., decreasing))

    We used two wrapr abstractions to capture the steps for re-use (something built in to rquery, and now also supplied by wrapr). The abstractions are:

    • srcfn() which wraps arbitrary quoted code as a function object.
    • pkgfn() which wraps a package qualified function name as a function object (“base” being the default package).

    This sort of pipeline can be applied to data using pipe notation:

    d %.>% pipeline
    ##   b   a 
    ## 2.0 1.5

    The above pipeline has one key inconvenience and one key weakness:

    • For the srcfn() steps we had to place the source code in quotes, which defeats any sort of syntax highlighting and auto-completing in our R integrated development environment (IDE).
    • The above pipeline has a reference to the value of threshold in our current environment, this means the pipeline is not sufficiently self-contained to serialize and share.

    We can quickly address both of these issues with the wrapr::qe() (“quote expression”) function. It uses base::substitute() to quote its arguments, and the IDE doesn’t know the contents are quoted and thus can help us with syntax highlighting and auto-completion. Also we are using base::bquote() .()-style escaping to bind in the value of threshold.

    pipeline <-
      srcfn(
        qe( .[.$value >= .(threshold), , drop = FALSE] )) %.>%
      srcfn(
        qe( tapply(.$value, .$group, 'mean') ))  %.>%
      pkgfn(
        "sort",
        arg_name = "x",
        args = list(decreasing = TRUE))
    
    cat(format(pipeline))
    ## UnaryFnList(
    ##    SrcFunction{ .[.$value >= 0, , drop = FALSE] }(.=., ),
    ##    SrcFunction{ tapply(.$value, .$group, "mean") }(.=., ),
    ##    base::sort(x=., decreasing))
    d %.>% pipeline
    ##   b   a 
    ## 2.0 1.5

    Notice this pipeline works as before, but no longer refers to the external value threshold. This pipeline can be saved and shared.

    Another recommended way to bind in values is with the args-argument, which is a named list of values that are expected to be available with a srcfn() is evaluated, or additional named arguments that will be applied to a pkgfn().

    In this notation the pipeline is written as follows.

    pipeline <-
      srcfn(
        qe( .[.$value >= threshold, , drop = FALSE] ),
        args = list('threshold' = threshold)) %.>%
      srcfn(
        qe( tapply(.$value, .$group, 'mean') ))  %.>%
      pkgfn(
        "sort",
        arg_name = "x",
        args = list(decreasing = TRUE))
    
    cat(format(pipeline))
    ## UnaryFnList(
    ##    SrcFunction{ .[.$value >= threshold, , drop = FALSE] }(.=., threshold),
    ##    SrcFunction{ tapply(.$value, .$group, "mean") }(.=., ),
    ##    base::sort(x=., decreasing))
    d %.>% pipeline
    ##   b   a 
    ## 2.0 1.5

    We can save this pipeline.

    saveRDS(pipeline, "wrapr_pipeline.RDS")

    And simulate using it in a fresh environment (i.e. simulate sharing it).

    # simulate a fresh environment
    rm(list = setdiff(ls(), "d"))
    
    library("wrapr")
    
    pipeline <- readRDS('wrapr_pipeline.RDS')
    
    cat(format(pipeline))
    ## UnaryFnList(
    ##    SrcFunction{ .[.$value >= threshold, , drop = FALSE] }(.=., threshold),
    ##    SrcFunction{ tapply(.$value, .$group, "mean") }(.=., ),
    ##    base::sort(x=., decreasing))
    d %.>% pipeline
    ##   b   a 
    ## 2.0 1.5

    Conclusion

    And that is some of the power of wrapr piping, rquery/rqdatatable, and wrapr function objects. Essentially wrapr function objects are a reference application of the S3/S4 piping abilities discussed in the wrapr pipe formal article.

    The technique is very convenient when each of the steps is a substantial (such as non-trivial data preparation and model application steps).

    The above techniques can make reproducing and sharing methods much easier.

    We have some more examples of the technique here and here.

    # clean up after example
    unlink("rquery_optree.RDS")
    unlink("wrapr_pipeline.RDS")

    Continue Reading…

    Collapse

    Read More

    R Packages worth a look

    Parallel GLM (parglm)
    Provides a parallel estimation method for generalized linear models without compiling with a multithreaded LAPACK or BLAS.

    Utilizes the Black-Scholes Option Pricing Model to Perform Strategic Option Analysis and Plot Option Strategies (optionstrat)
    Utilizes the Black-Scholes-Merton option pricing model to calculate key option analytics and graphical analysis of various option strategies. Provides …

    Individual Tree Growth Modeling (ITGM)
    Individual tree model is an instrument to support the decision with regard to forest management. This package provides functions that let you work with …

    Native R Kernel for the ‘Jupyter Notebook’ (IRkernel)
    The R kernel for the ‘Jupyter’ environment executes R code which the front-end (‘Jupyter Notebook’ or other front-ends) submits to the kernel via the n …

    Continue Reading…

    Collapse

    Read More

    Facebook contributes to MLPerf, open-sources Mask R-CNN2Go

    Continue Reading…

    Collapse

    Read More

    Magister Dixit

    “Information is not knowledge and knowledge is not wisdom.” James Gleick

    Continue Reading…

    Collapse

    Read More

    Thanks for reading!