My Data Science Blogs

May 21, 2018

Nigel Holmes new illustrated book on Crazy Competitions

Nigel Holmes, the graphic designer known for his playful illustrated graphics, has a new book: Crazy Competitions. It’s exactly what it sounds like.

Whether it’s flinging frozen rats or parading in holly evergreens, racing snails or carrying wives, human beings have long displayed their creativity in wild, odd, and sometimes just wonderful rituals and competitions. To show what lengths we’ll go to uphold our eccentric customs, British American graphic designer Nigel Holmes channels his belief in the power of hilarity to bring together a bewilderingly funny tour around the globe in search of incredible events, all dryly explained with brilliant infographics in WOW! 100 Crazy Contests and Celebrations from around the World.

Tags: , ,

Continue Reading…


Read More

Document worth reading: “Introduction to Tensor Decompositions and their Applications in Machine Learning”

Tensors are multidimensional arrays of numerical values and therefore generalize matrices to multiple dimensions. While tensors first emerged in the psychometrics community in the $20^{\text{th}}$ century, they have since then spread to numerous other disciplines, including machine learning. Tensors and their decompositions are especially beneficial in unsupervised learning settings, but are gaining popularity in other sub-disciplines like temporal and multi-relational data analysis, too. The scope of this paper is to give a broad overview of tensors, their decompositions, and how they are used in machine learning. As part of this, we are going to introduce basic tensor concepts, discuss why tensors can be considered more rigid than matrices with respect to the uniqueness of their decomposition, explain the most important factorization algorithms and their properties, provide concrete examples of tensor decomposition applications in machine learning, conduct a case study on tensor-based estimation of mixture models, talk about the current state of research, and provide references to available software libraries. Introduction to Tensor Decompositions and their Applications in Machine Learning

Continue Reading…


Read More

Magister Dixit

“Passion Matters: Some people go to work. Others get up each morning and work with a desire to make a difference. We don’t need to save the world to make a difference.” Kaan Turnali ( Feb 21, 2015 )

Continue Reading…


Read More

If you did not already know

Dionysius google
We address the following problem: How do we incorporate user item interaction signals as part of the relevance model in a large-scale personalized recommendation system such that, (1) the ability to interpret the model and explain recommendations is retained, and (2) the existing infrastructure designed for the (user profile) content-based model can be leveraged? We propose Dionysius, a hierarchical graphical model based framework and system for incorporating user interactions into recommender systems, with minimal change to the underlying infrastructure. We learn a hidden fields vector for each user by considering the hierarchy of interaction signals, and replace the user profile-based vector with this learned vector, thereby not expanding the feature space at all. Thus, our framework allows the use of existing recommendation infrastructure that supports content based features. We implemented and deployed this system as part of the recommendation platform at LinkedIn for more than one year. We validated the efficacy of our approach through extensive offline experiments with different model choices, as well as online A/B testing experiments. Our deployment of this system as part of the job recommendation engine resulted in significant improvement in the quality of retrieved results, thereby generating improved user experience and positive impact for millions of users. …

Predicted Relevance Model (PRM) google
Evaluation of search engines relies on assessments of search results for selected test queries, from which we would ideally like to draw conclusions in terms of relevance of the results for general (e.g., future, unknown) users. In practice however, most evaluation scenarios only allow us to conclusively determine the relevance towards the particular assessor that provided the judgments. A factor that cannot be ignored when extending conclusions made from assessors towards users, is the possible disagreement on relevance, assuming that a single gold truth label does not exist. This paper presents and analyzes the Predicted Relevance Model (PRM), which allows predicting a particular result’s relevance for a random user, based on an observed assessment and knowledge on the average disagreement between assessors. With the PRM, existing evaluation metrics designed to measure binary assessor relevance, can be transformed into more robust and effectively graded measures that evaluate relevance towards a random user. It also leads to a principled way of quantifying multiple graded or categorical relevance levels for use as gains in established graded relevance measures, such as normalized discounted cumulative gain (nDCG), which nowadays often use heuristic and data-independent gain values. Given a set of test topics with graded relevance judgments, the PRM allows evaluating systems on different scenarios, such as their capability of retrieving top results, or how well they are able to filter out non-relevant ones. Its use in actual evaluation scenarios is illustrated on several information retrieval test collections. …

Symbolic Data Analysis (SDA) google
Symbolic data analysis (SDA) is an extension of standard data analysis where symbolic data tables are used as input and symbolic objects are outputted as a result. The data units are called symbolic since they are more complex than standard ones, as they not only contain values or categories, but also include internal variation and structure. SDA is based on four spaces: the space of individuals, the space of concepts, the space of descriptions, and the space of symbolic objects. The space of descriptions models individuals, while the space of symbolic objects models concepts.
An Introduction to Symbolic Data Analysis and the Sodas Software

Continue Reading…


Read More

R Packages worth a look

Region-Level Connectivity Network Construction via Kernel Canonical Correlation Analysis (brainKCCA)
It is designed to calculate connection between (among) brain regions and plot connection lines. Also, the summary function is included to summarize group-level connectivity network. Kang, Jian (2016) <doi:10.1016/j.neuroimage.2016.06.042>.

Composition of Probabilistic Preferences (CPP) (CPP)
CPP is a multiple criteria decision method to evaluate alternatives on complex decision making problems, by a probabilistic approach. The CPP was created and expanded by Sant’Anna, Annibal P. (2015) <doi:10.1007/978-3-319-11277-0>.

Datasets from ‘KEEL’ for it Use in ‘RKEEL’ (RKEELdata)
KEEL’ is a popular Java software for a large number of different knowledge data discovery tasks. Furthermore, ‘RKEEL’ is a package with a R code layer between R and ‘KEEL’, for using ‘KEEL’ in R code. This package includes the datasets from ‘KEEL’ in .dat format for its use in ‘RKEEL’ package. For more information about ‘KEEL’, see <http://…/>.

Continue Reading…


Read More

May 20, 2018

Data Links #154



Just two links for this section this week, but for an issue I feel is not getting all the attention it should.

  • US cell carriers are selling access to real-time phone location data. There is a very interesting discussion on Hacker News. I'll quote from the first comment in the discussion. They're sorted depending on the votes of the rest of the readers, but it's unlikely that will change right now:

    I work in location / mapping / geo. Some of us have been waiting for this to blow (which it hasn't yet). The public has zero idea how much personal location data is available.

    It's not just your cell carrier. Your cell phone chip manufacturer, GPS chip manufacturer, phone manufacturer and then pretty much anyone on the installed OS (android crapware) is getting a copy of your location data. Usually not in software but by contract, one gives gps data to all the others as part of the bill of materials.

    This is then usually (but not always) "anonymized" by cutting it in to ~5 second chunks. It's easy to put it back together again. We can figure out everything about your day from when you wake up to where you go to when you sleep.

    This data is sold to whoever wants it. Hedge funds or services who analyze it for hedge funds is the big one. It's normal to track hundreds of millions of people a day and trade stocks based on where they go. This isn't fantasy, it's what happens every day.

    Almost every web/smartphone mapping company is doing it, so is almost everyone that tracks you for some service - "turn the lights on when I get home". The web mapping companies and those that provide SDKs for "free". It's a monetization model for apps which don't need location. That's why Apple is trying hard to restrict it without scaring off consumers.

  • And the second part. Tracking Firm LocationSmart Leaked Location Data for Customers of All Major U.S. Mobile Carriers Without Consent in Real Time Via Its Web Site.

    LocationSmart, a U.S. based company that acts as an aggregator of real-time data about the precise location of mobile phone devices, has been leaking this information to anyone via a buggy component of its Web site — without the need for any password or other form of authentication or authorization — KrebsOnSecurity has learned. The company took the vulnerable service offline early this afternoon after being contacted by KrebsOnSecurity, which verified that it could be used to reveal the location of any AT&T, Sprint, T-Mobile or Verizon phone in the United States to an accuracy of within a few hundred yards.


  • Sending Inaudible Commands to Voice Assistants.

    A group of students from University of California, Berkeley, and Georgetown University showed in 2016 that they could hide commands in white noise played over loudspeakers and through YouTube videos to get smart devices to turn on airplane mode or open a website.

    This month, some of those Berkeley researchers published a research paper that went further, saying they could embed commands directly into recordings of music or spoken text. So while a human listener hears someone talking or an orchestra playing, Amazon's Echo speaker might hear an instruction to add something to your shopping list.

Data Links is a periodic blog post published on Sundays (specific time may vary) which contains interesting links about data science, machine learning and related topics. You can subscribe to it using the general blog RSS feed or this one, which only contains these articles, if you are not interested in other things I might publish.

Have you read an article you liked and would you like to suggest it for the next issue? Just contact me!

Continue Reading…


Read More

ML models: What they can’t learn?

(This article was first published on English –, and kindly contributed to R-bloggers)

What I love in conferences are the people, that come after your talk and say: It would be cool to add XYZ to your package/method/theorem.

After the eRum (great conference by the way) I was lucky to hear from Tal Galili: It would be cool to use DALEX for teaching, to show how different ML models are learning relations.

Cool idea. So let’s see what can and what cannot be learned by the most popular ML models. Here we will compare random forest against linear models against SVMs.
Find the full example here. We simulate variables from uniform U[0,1] distribution and calculate y from following equation

In all figures below we compare PDP model responses against the true relation between variable x and the target variable y (pink color). All these plots are created with DALEX package.

For x1 we can check how different models deal with a quadratic relation. The linear model fails without prior feature engineering, random forest is guessing the shape but the best fit if found by SVMs.

With sinus-like oscillations the story is different. SVMs are not that flexible while random forest is much closer.

Turns out that monotonic relations are not easy for these models. The random forest is close but event here we cannot guarantee the monotonicity.

The linear model is the best one when it comes to truly linear relation. But other models are not that far.

The abs(x) is not an easy case for neither model.

Find the R codes here.

Of course the behavior of all these models depend on number of observation, noise to signal ratio, correlation among variables and interactions.
Yet is may be educational to use PDP curves to see how different models are learning relations. What they can grasp easily and what they cannot.

To leave a comment for the author, please follow the link and comment on their blog: English – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Book Memo: “Building a Recommendation System with R”

A recommendation system performs extensive data analysis in order to generate suggestions to its users about what might interest them. R has recently become one of the most popular programming languages for the data analysis. Its structure allows you to interactively explore the data and its modules contain the most cutting-edge techniques thanks to its wide international community. This distinctive feature of the R language makes it a preferred choice for developers who are looking to build recommendation systems. The book will help you understand how to build recommender systems using R. It starts off by explaining the basics of data mining and machine learning. Next, you will be familiarized with how to build and optimize recommender models using R. Following that, you will be given an overview of the most popular recommendation techniques. Finally, you will learn to implement all the concepts you have learned throughout the book to build a recommender system.

Continue Reading…


Read More

Book Memo: “Guide to Modeling and Simulation of Systems of Systems”

This easy-to-follow textbook provides an exercise-driven guide to the use of the Discrete Event Systems Specification (DEVS) simulation modeling formalism and the System Entity Structure (SES) simulation model ontology supported with the latest advances in software architecture and design principles, methods, and tools for building and testing virtual Systems of Systems (SoS). The book examines a wide variety of SoS problems, ranging from cloud computing systems to biological systems in agricultural food crops. This enhanced and expanded second edition also features a new chapter on DEVS support for Markov modeling and simulation. Topics and features: provides an extensive set of exercises throughout the text to reinforce the concepts and encourage use of the tools, supported by introduction and summary sections; discusses how the SoS concept and supporting virtual build and test environments can overcome the limitations of current approaches; offers a step-by-step introduction to the DEVS concepts and modeling environment features required to build sophisticated SoS models; describes the capabilities and use of the tools CoSMoS/DEVS-Suite, Virtual Laboratory Environment, and MS4 Me™; reviews a range of diverse applications, from the development of new satellite design and launch technologies, to surveillance and control in animal epidemiology; examines software/hardware co-design for SoS, and activity concepts that bridge information-level requirements and energy consumption in the implementation; demonstrates how the DEVS formalism supports Markov modeling within an advanced modeling and simulation environment (NEW). This accessible and hands-on textbook/reference provides invaluable practical guidance for graduate students interested in simulation software development and cyber-systems engineering design, as well as for practitioners in these, and related areas.

Continue Reading…


Read More

Rcpp 0.12.17: More small updates

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Another bi-monthly update and the seventeenth release in the 0.12.* series of Rcpp landed on CRAN late on Friday following nine (!!) days in gestation in the incoming/ directory of CRAN. And no complaints: we just wish CRAN were a little more forthcoming with what is happenening when, and/or would let us help supplying additional test information. I do run a fairly insane amount of backtests prior to releases, only to then have to wait another week or more is … not ideal. But again, we all owe CRAN and immense amount of gratitude for all they do, and do so well.

So once more, this release follows the 0.12.0 release from July 2016, the 0.12.1 release in September 2016, the 0.12.2 release in November 2016, the 0.12.3 release in January 2017, the 0.12.4 release in March 2016, the 0.12.5 release in May 2016, the 0.12.6 release in July 2016, the 0.12.7 release in September 2016, the 0.12.8 release in November 2016, the 0.12.9 release in January 2017, the 0.12.10.release in March 2017, the 0.12.11.release in May 2017, the 0.12.12 release in July 2017, the 0.12.13.release in late September 2017, the 0.12.14.release in November 2017, the 0.12.15.release in January 2018 and the 0.12.16.release in March 2018 making it the twenty-first release at the steady and predictable bi-montly release frequency.

Rcpp has become the most popular way of enhancing GNU R with C or C++ code. As of today, 1362 packages on CRAN depend on Rcpp for making analytical code go faster and further, along with another 138 in the current BioConductor release 3.7.

Compared to other releases, this release contains again a relatively small change set, but between Kevin and Romain cleaned a few things up. Full details are below.

Changes in Rcpp version 0.12.17 (2018-05-09)

  • Changes in Rcpp API:

    • The random number Generator class no longer inhreits from RNGScope (Kevin in #837 fixing #836).

    • A spurious parenthesis was removed to please gcc8 (Dirk fixing #841)

    • The optional Timer class header now undefines FALSE which was seen to have side-effects on some platforms (Romain in #847 fixing #846).

    • Optional StoragePolicy attributes now also work for string vectors (Romain in #850 fixing #849).

Thanks to CRANberries, you can also look at a diff to the previous release. As always, details are on the Rcpp Changelog page and the Rcpp page which also leads to the downloads page, the browseable doxygen docs and zip files of doxygen output for the standard formats. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Prior distributions and the Australia principle

The Kangaroo with a feather effect

There’s an idea in philosophy called the Australia principle—I don’t know the original of this theory but here’s an example that turned up in a google search—that posits that Australia doesn’t exist; instead, they just build the parts that are needed when you visit: a little mock-up of the airport, a cityscape with a model of the Sydney Opera House in the background, some kangaroos, a bunch of desert in case you go into the outback, etc. The idea is that it would be ridiculously inefficient to build an entire continent and that it makes much more sense for them to just construct a sort of stage set for the few places you’ll ever go.

And this is the principle underlying the article, The prior can often only be understood in the context of the likelihood, by Dan Simpson, Mike Betancourt, and myself. The idea is that, for any given problem, for places in parameter space where the likelihood is strong, relative to the questions you’re asking, you won’t need to worry much about the prior; something vague will do. And in places where the likelihood is weak, relative to the questions you’re asking, you’ll need to construct more of a prior to make up the difference.

This implies:
1. The prior can often only be understood in the context of the likelihood.
2. What prior is needed can depend on the question being asked.

To follow up on item 2, consider a survey of 3000 people, each of whom is asked a binary survey response, and suppose this survey is a simple random sample of the general population. If this is a public opinion poll, N = 3000 is more than enough: the standard error of the sample proportion is something like 0.5/sqrt(3000) = 0.01; you can estimate a proportion to an accuracy of about 1 percentage point, which is fine for all practical purposes, especially considering that, realistically, nonsampling error will be likely be more than that anyway. On the other hand, if the question on this survey of 3000 people is whether your baby is a boy or a girl, and if the goal is to compare sex ratios of beautiful and ugly parents, then N = 3000 is way way too small to tell you anything (see, for example, the discussion on page 645 here), and if you want any kind of reasonable posterior distribution for the difference in sex ratios you’ll need a strong prior. You need to supply the relevant scenery yourself, as it’s not coming from the likelihood.

The same principle—that the prior you need depends on the other information you have and the question you’re asking—also applies to assumptions within the data model (which in turn determines the likelihood). But for simplicity here we’re following the usual convention and pretending that the likelihood is known exactly ahead of time so that all the modeling choices arise in the prior.

P.S. The funny thing is, Dan Simpson is from Australia himself. Just a coincidence, I’m sure.

The post Prior distributions and the Australia principle appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

Statistics Sunday: Welcome to Sentiment Analysis with “Hotel California”

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

Welcome to the Hotel California As promised in last week’s post, this week: sentiment analysis, also with song lyrics.

Sentiment analysis is a method of natural language processing that involves classifying words in a document based on whether a word is positive or negative, or whether it is related to a set of basic human emotions; the exact results differ based on the sentiment analysis method selected. The tidytext R package has 4 different sentiment analysis methods:

  • “AFINN” for Finn Årup Nielsen – which classifies words from -5 to +5 in terms of negative or positive valence
  • “bing” for Bing Liu and colleagues – which classifies words as either positive or negative
  • “loughran” for Loughran-McDonald – mostly for financial and nonfiction works, which classifies as positive or negative, as well as topics of uncertainty, litigious, modal, and constraining
  • “nrc” for the NRC lexicon – which classifies words into eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) as well as positive or negative sentiment

Sentiment analysis works on unigrams – single words – but you can aggregate across multiple words to look at sentiment across a text.

To demonstrate sentiment analysis, I’ll use one of my favorite songs: “Hotel California” by the Eagles.

I know, I know.

Using similar code as last week, let’s pull in the lyrics of the song.

hotel_calif <- genius_lyrics(artist = "Eagles", song = "Hotel California") %>%
mutate(line = row_number())

First, we’ll chop up these 43 lines into individual words, using the tidytext package and unnest_tokens function.

tidy_hc <- hotel_calif %>%

This is also probably the point I would remove stop words with anti_join. But these common words are very unlikely to have a sentiment attached to them, so I’ll leave them in, knowing they’ll be filtered out anyway by this analysis. We have 4 lexicons to choose from. Loughran is more financial and textual, but we’ll still see how well it can classify the words anyway. First, let’s create a data frame of our 4 sentiment lexicons.

new_sentiments <- sentiments %>%
mutate( sentiment = ifelse(lexicon == "AFINN" & score >= 0, "positive",
ifelse(lexicon == "AFINN" & score < 0,
"negative", sentiment))) %>%
group_by(lexicon) %>%
mutate(words_in_lexicon = n_distinct(word)) %>%

Now, we’ll see how well the 4 lexicons match up with the words in the lyrics. Big thanks to Debbie Liske at Data Camp for this piece of code (and several other pieces used in this post):

my_kable_styling <- function(dat, caption) {
kable(dat, "html", escape = FALSE, caption = caption) %>%
kable_styling(bootstrap_options = c("striped", "condensed", "bordered"),
full_width = FALSE)

tidy_hc %>%
mutate(words_in_lyrics = n_distinct(word)) %>%
inner_join(new_sentiments) %>%
group_by(lexicon, words_in_lyrics, words_in_lexicon) %>%
summarise(lex_match_words = n_distinct(word)) %>%
ungroup() %>%
mutate(total_match_words = sum(lex_match_words),
match_ratio = lex_match_words/words_in_lyrics) %>%
select(lexicon, lex_match_words, words_in_lyrics, match_ratio) %>%
mutate(lex_match_words = color_bar("lightblue")(lex_match_words),
lexicon = color_tile("lightgreen","lightgreen")(lexicon)) %>%
my_kable_styling(caption = "Lyrics Found In Lexicons")
## Joining, by = "word"
Lyrics Found In Lexicons
lexicon lex_match_words words_in_lyrics match_ratio
AFINN 18 175 0.1028571
bing 18 175 0.1028571
loughran 1 175 0.0057143
nrc 23 175 0.1314286

NRC offers the best match, classifying about 13% of the words in the lyrics. (It’s not unusual to have such a low percentage. Not all words have a sentiment.)

hcsentiment <- tidy_hc %>%
inner_join(get_sentiments("nrc"), by = "word")

## # A tibble: 103 x 4
## track_title line word sentiment
## 1 Hotel California 1 dark sadness
## 2 Hotel California 1 desert anger
## 3 Hotel California 1 desert disgust
## 4 Hotel California 1 desert fear
## 5 Hotel California 1 desert negative
## 6 Hotel California 1 desert sadness
## 7 Hotel California 1 cool positive
## 8 Hotel California 2 smell anger
## 9 Hotel California 2 smell disgust
## 10 Hotel California 2 smell negative
## # ... with 93 more rows

Let’s visualize the counts of different emotions and sentiments in the NRC lexicon.

theme_lyrics <- function(aticks = element_blank(),
pgminor = element_blank(),
lt = element_blank(),
lp = "none")
theme(plot.title = element_text(hjust = 0.5), #Center the title
axis.ticks = aticks, #Set axis ticks to on or off
panel.grid.minor = pgminor, #Turn the minor grid lines on or off
legend.title = lt, #Turn the legend title on or off
legend.position = lp) #Turn the legend on or off

hcsentiment %>%
group_by(sentiment) %>%
summarise(word_count = n()) %>%
ungroup() %>%
mutate(sentiment = reorder(sentiment, word_count)) %>%
ggplot(aes(sentiment, word_count, fill = -word_count)) +
geom_col() +
guides(fill = FALSE) +
theme_minimal() + theme_lyrics() +
labs(x = NULL, y = "Word Count") +
ggtitle("Hotel California NRC Sentiment Totals") +

Most of the words appear to be positively-valenced. How do the individual words match up?


plot_words <- hcsentiment %>%
group_by(sentiment) %>%
count(word, sort = TRUE) %>%
arrange(desc(n)) %>%

plot_words %>%
ggplot(aes(word, 1, label = word, fill = sentiment)) +
geom_point(color = "white") +
geom_label_repel(force = 1, nudge_y = 0.5,
direction = "y",
box.padding = 0.04,
segment.color = "white",
size = 3) +
facet_grid(~sentiment) +
theme_lyrics() +
theme(axis.text.y = element_blank(), axis.line.x = element_blank(),
axis.title.x = element_blank(), axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
panel.grid = element_blank(), panel.background = element_blank(),
panel.border = element_rect("lightgray", fill = NA),
strip.text.x = element_text(size = 9)) +
xlab(NULL) + ylab(NULL) +
ggtitle("Hotel California Words by NRC Sentiment") +

It looks like some words are being misclassified. For instance, “smell” as in “warm smell of colitas” is being classified as anger, disgust, and negative. But that doesn’t explain the overall positive bent being applied to the song. If you listen to the song, you know it’s not really a happy song. It starts off somewhat negative – or at least, ambiguous – as the narrator is driving on a dark desert highway. He’s tired and having trouble seeing, and notices the Hotel California, a shimmering oasis on the horizon. He stops in and is greated by a “lovely face” in a “lovely place.” At the hotel, everyone seems happy: they dance and drink, they have fancy cars, they have pretty “friends.”

But the song is in a minor key. Though not always a sign that a song is sad, it is, at the very least, a hint of something ominous, lurking below the surface. Soon, things turn bad for the narrator. The lovely-faced woman tells him they are “just prisoners here of our own device.” He tries to run away, but the night man tells him, “You can check out anytime you like, but you can never leave.”

The song seems to be a metaphor for something, perhaps fame and excess, which was also the subject of another song on the same album, “Life in the Fast Lane.” To someone seeking fame, life is dreary, dark, and deserted. Fame is like an oasis – beautiful and shimmering, an escape. But it isn’t all it appears to be. You may be surrounded by beautiful people, but you can only call them “friends.” You trust no one. And once you join that lifestyle, you might be able to check out, perhaps through farewell tour(s), but you can never leave that life – people know who you are (or were) and there’s no disappearing. And it could be about something even darker that it’s hard to escape from, like substance abuse. Whatever meaning you ascribe to the song, the overall message seems to be that things are not as wonderful as they appear on the surface.

So if we follow our own understanding of the song’s trajectory, we’d say it starts off somewhat negatively, becomes positive in the middle, then dips back into the negative at the end, when the narrator tries to escape and finds he cannot.

We can chart this, using the line number, which coincides with the location of the word in the song. We’ll stick with NRC since it offered the best match, but for simplicity, we’ll only pay attention to the positive and negative sentiment codes.

hcsentiment_index <- tidy_hc %>%
filter(sentiment %in% c("positive",
"negative"))) %>%
count(index = line, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"

This gives us a data frame that aggregates sentiment by line. If a line contains more positive than negative words, its overall sentiment is positive, and vice versa. Because not every word in the lyrics has a sentiment, not every line has an associated aggregate sentiment. But it gives us a sort of trajectory over the course of the song. We can visualize this trajectory like this:

hcsentiment_index %>%
ggplot(aes(index, sentiment, fill = sentiment > 0)) +
geom_col(show.legend = FALSE)

As the chart shows, the song starts somewhat positive, with a dip soon after into the negative. The middle of the song is positive, as the narrator describes the decadence of the Hotel California. But it turns dark at the end, and stays that way as the guitar solo soars in.


This awesome post by Debbie Liske, mentioned earlier, for her code and custom functions to make my charts pretty.

Text Mining with R: A Tidy Approach by Julia Silge and David Robinson

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Book Memo: “Data Literacy”

How to Make Your Experiments Robust and Reproducible
Data Literacy: How to Make Your Experiments Robust and Reproducible provides an overview of basic concepts and skills in handling data, which are common to diverse areas of science. Readers will get a good grasp of the steps involved in carrying out a scientific study and will understand some of the factors that make a study robust and reproducible.The book covers several major modules such as experimental design, data cleansing and preparation, statistical analysis, data management, and reporting. No specialized knowledge of statistics or computer programming is needed to fully understand the concepts presented.
This book is a valuable source for biomedical and health sciences graduate students and researchers, in general, who are interested in handling data to make their research reproducible and more efficient.
• Presents the content in an informal tone and with many examples taken from the daily routine at laboratories
• Can be used for self-studying or as an optional book for more technical courses
• Brings an interdisciplinary approach which may be applied across different areas of sciences

Continue Reading…


Read More

Book Memo: “Data Science”

Innovative Developments in Data Analysis and Clustering
This edited volume on the latest advances in data science covers a wide range of topics in the context of data analysis and classification. In particular, it includes contributions on classification methods for high-dimensional data, clustering methods, multivariate statistical methods, and various applications. The book gathers a selection of peer-reviewed contributions presented at the Fifteenth Conference of the International Federation of Classification Societies (IFCS2015), which was hosted by the Alma Mater Studiorum, University of Bologna, from July 5 to 8, 2015.

Continue Reading…


Read More

If you did not already know

Vector Field Based Neural Network google
A novel Neural Network architecture is proposed using the mathematically and physically rich idea of vector fields as hidden layers to perform nonlinear transformations in the data. The data points are interpreted as particles moving along a flow defined by the vector field which intuitively represents the desired movement to enable classification. The architecture moves the data points from their original configuration to anew one following the streamlines of the vector field with the objective of achieving a final configuration where classes are separable. An optimization problem is solved through gradient descent to learn this vector field. …

Generative Adversarial Capsule Network (CapsuleGAN) google
We present Generative Adversarial Capsule Network (CapsuleGAN), a framework that uses capsule networks (CapsNets) instead of the standard convolutional neural networks (CNNs) as discriminators within the generative adversarial network (GAN) setting, while modeling image data. We provide guidelines for designing CapsNet discriminators and the updated GAN objective function, which incorporates the CapsNet margin loss, for training CapsuleGAN models. We show that CapsuleGAN outperforms convolutional-GAN at modeling image data distribution on the MNIST dataset of handwritten digits, evaluated on the generative adversarial metric and at semi-supervised image classification. …

MFCMT google
Discriminative Correlation Filters (DCF)-based tracking algorithms exploiting conventional handcrafted features have achieved impressive results both in terms of accuracy and robustness. Template handcrafted features have shown excellent performance, but they perform poorly when the appearance of target changes rapidly such as fast motions and fast deformations. In contrast, statistical handcrafted features are insensitive to fast states changes, but they yield inferior performance in the scenarios of illumination variations and background clutters. In this work, to achieve an efficient tracking performance, we propose a novel visual tracking algorithm, named MFCMT, based on a complementary ensemble model with multiple features, including Histogram of Oriented Gradients (HOGs), Color Names (CNs) and Color Histograms (CHs). Additionally, to improve tracking results and prevent targets drift, we introduce an effective fusion method by exploiting relative entropy to coalesce all basic response maps and get an optimal response. Furthermore, we suggest a simple but efficient update strategy to boost tracking performance. Comprehensive evaluations are conducted on two tracking benchmarks demonstrate and the experimental results demonstrate that our method is competitive with numerous state-of-the-art trackers. Our tracker achieves impressive performance with faster speed on these benchmarks. …

Continue Reading…


Read More

May 19, 2018

R Packages worth a look

Group Sequential Design for a Clinical Trial with Censored Survival Data (SurvGSD)
Sample size calculation utilizing the information fraction and the alpha spending function in a group sequential clinical trial with censored survival data from underlying generalized gamma survival distributions or log-logistic survival distributions. Hsu, C.-H., Chen, C.-H, Hsu, K.-N. and Lu, Y.-H. (2018) A useful design utilizing the information fraction in a group sequential clinical trial with censored survival data. To appear in Biometrics.

JAR Dependencies for the ‘DatabaseConnector’ Package (DatabaseConnectorJars)
Provides external JAR dependencies for the ‘DatabaseConnector’ package.

Estimate ED50 Based on Modified Turning Point Method (modTurPoint)
Turning point method is a method proposed by Choi (1990) <doi:10.2307/2531453> to estimate 50 percent effective dose (ED50) in the study of drug sensitivity. The method has its own advantages for that it can provide robust ED50 estimation. This package contains the modified function of Choi’s turning point method.

Continue Reading…


Read More

R/exams @ eRum 2018

(This article was first published on R/exams, and kindly contributed to R-bloggers)

Keynote lecture about R/exams at eRum 2018 (European R Users Meeting) in Budapest: Slides, video, e-learning, replication materials.

R/exams @ eRum 2018

Keynote lecture at eRum 2018

R/exams was presented in a keynote lecture by Achim Zeileis at eRum 2018, the European R Users Meeting, this time organized by a team around Gergely Daróczi in Budapest. It was a great event with many exciting presentations, reflecting the vibrant R community in Europe (and beyond).

This blog post provides various resources accompanying the presentation which may be of interest to those who did not attend the meeting as well as those who did and who want to explore the materials in more detail.

Most importantly the presentation slides are available in PDF format (under CC-BY):



The eRum organizers did a great job in making the meeting accessible to those useRs who could not make it to Budapest. All presentations were available in a livestream on YouTube where also videos of all lectures were made available after the meeting (Standard YouTube License):



To illustrate the e-learning capabilities supported by R/exams, the presentation started with a live quiz using the audience response system ARSnova. The original version of the quiz was hosted on the ARSnova installation at Universität Innsbruck. To encourage readers to try out ARSnova for their own purposes, a copy of the quiz was also posted on the official ARSnova server at Technische Hochschule Mittelhessen (where ARSnova is developed under the General Public License, GPL):


The presentation briefly also showed an online test generated by R/exams and imported into OpenOLAT, an open-source learning management system (available under the Apache License). The online test is made available again here for anonymous guest access. (Note however, that the system only has one guest user so that when you start the test there may already be some test results from a previous guest session. In that case you can finish the test and also start it again.)


Replication code

The presentation slides show how to set up an exam using the R package and then rendering it into different output formats. In order to allow the same exam to be rendered into a wide range of different output formats, only single-choice and multiple-choice exercises were employed (see the choice list below). However, in the e-learning test shown in OpenOLAT all exercises types are supported (see the elearn list below). All these exercises are readily provided in the package and also introduced online: deriv/deriv2, fruit/fruit2, ttest, boxplots, cholesky, lm, function. The code below uses the R/LaTeX (.Rnw) version but the R/Markdown version (.Rmd) could also be used instead.

## package

## single-choice and multiple-choice only
choice <- list("deriv2.Rnw", "fruit2.Rnw", c("ttest.Rnw", "boxplots.Rnw"))

## e-learning test (all exercise types)
elearn <- c("deriv.Rnw", "fruit.Rnw", "ttest.Rnw", "boxplots.Rnw",
  "cholesky.Rnw", "lm.Rnw", "function.Rnw")

First, the exam with the choice-based questions can be easily turned into a PDF exam in NOPS format using exams2nops, here using Hungarian language for illustration. Exams in this format can be easily scanned and evaluated within R.

exams2nops(choice, institution = "eRum 2018", language = "hu")

Second, the choice-based exam version can be exported into the JSON format for ARSnova: Rexams-1.json. This contains an entire ARSnova session that can be directly imported into the ARSnova system as shown above. It employs a custom exercise set up just for eRum (conferences.Rmd) as well as a slightly tweaked exercise (fruit3.Rmd) that displays better in ARSnova.

exams2arsnova(list("conferences.Rmd", choice[[1]], "fruit3.Rmd", choice[[3]]),
  name = "R/exams", abstention = FALSE, fix_choice = TRUE)

Third, the e-learning exam can be generated in QTI 1.2 format for OpenOLAT, as shown above: The exams2openolat command below is provided starting from the current R/exams version 2.3-1. It essentially just calls exams2qti12 but slightly tweaks the MathJax output from pandoc so that it is displayed properly by OpenOLAT.

exams2openolat(elearn, name = "eRum-2018", n = 10, qti = "1.2")

What else?

In the last part of the presentation a couple of new and ongoing efforts within the R/exams project are highlighted. First, the natural language support in NOPS exams is mentioned which was recently described in more detail in this blog. Second, the relatively new “stress tester” was illustrated with the following example. (A more detailed blog post will follow soon.)

s <- stresstest_exercise("deriv2.Rnw")

Finally, a psychometric analysis illustrated how to examine exams regarding: Exercise difficulty, student performance, unidimensionality, fairness. The replication code for the results from the slides is included below (omitting some graphical details for simplicity, e.g., labeling or color).

## load data and exclude extreme scorers
data("MathExam14W", package = "psychotools")
mex <- subset(MathExam14W, nsolved > 0 & nsolved < 13)

## raw data

## Rasch model parameters
mr <- raschmodel(mex$solved)
plot(mr, type = "profile")

## points per student
MathExam14W <- transform(MathExam14W,
  points = 2 * nsolved - 0.5 * rowSums(credits == 1)
hist(MathExam14W$points, breaks = -4:13 * 2 + 0.5, col = "lightgray")
abline(v = 12.5, lwd = 2, col = 2)

## person-item map
plot(mr, type = "piplot")

## principal component analysis
pr <- prcomp(mex$solved, scale = TRUE)
biplot(pr, col = c("transparent", "black"),
  xlim = c(-0.065, 0.005), ylim = c(-0.04, 0.065))

## differential item functioning
mr1 <- raschmodel(subset(mex, group == 1)$solved)
mr2 <- raschmodel(subset(mex, group == 2)$solved)
ma <- anchortest(mr1, mr2, adjust = "single-step")

## anchored item difficulties
plot(mr1, parg = list(ref = ma$anchor_items), ref = FALSE, ylim = c(-2, 3), pch = 19)
plot(mr2, parg = list(ref = ma$anchor_items), ref = FALSE, add = TRUE, pch = 19, border = 4)
legend("topleft", paste("Group", 1:2), pch = 19, col = c(1, 4), bty = "n")

## simultaneous Wald test for pairwise differences

To leave a comment for the author, please follow the link and comment on their blog: R/exams. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Document worth reading: “The Expressive Power of Neural Networks: A View from the Width”

The expressive power of neural networks is important for understanding deep learning. Most existing works consider this problem from the view of the depth of a network. In this paper, we study how width affects the expressiveness of neural networks. Classical results state that \emph{depth-bounded} (e.g. depth-$2$) networks with suitable activation functions are universal approximators. We show a universal approximation theorem for \emph{width-bounded} ReLU networks: width-$(n+4)$ ReLU networks, where $n$ is the input dimension, are universal approximators. Moreover, except for a measure zero set, all functions cannot be approximated by width-$n$ ReLU networks, which exhibits a phase transition. Several recent works demonstrate the benefits of depth by proving the depth-efficiency of neural networks. That is, there are classes of deep networks which cannot be realized by any shallow network whose size is no more than an \emph{exponential} bound. Here we pose the dual question on the width-efficiency of ReLU networks: Are there wide networks that cannot be realized by narrow networks whose size is not substantially larger? We show that there exist classes of wide networks which cannot be realized by any narrow network whose depth is no more than a \emph{polynomial} bound. On the other hand, we demonstrate by extensive experiments that narrow networks whose size exceed the polynomial bound by a constant factor can approximate wide and shallow network with high accuracy. Our results provide more comprehensive evidence that depth is more effective than width for the expressiveness of ReLU networks. The Expressive Power of Neural Networks: A View from the Width

Continue Reading…


Read More

RcppGSL 0.3.5

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A maintenance update of RcppGSL just brought version 0.3.5 to CRAN, a mere twelve days after the RcppGSL 0.3.4. release. Just like yesterday’s upload of inline 0.3.15 it was prompted by a CRAN request to update the per-package manual page; see the inline post for details.

The RcppGSL package provides an interface from R to the GNU GSL using the Rcpp package.

No user-facing new code or features were added. The NEWS file entries follow below:

Changes in version 0.3.5 (2018-05-19)

  • Update package manual page using references to DESCRIPTION file [CRAN request].

Courtesy of CRANberries, a summary of changes to the most recent release is available.

More information is on the RcppGSL page. Questions, comments etc should go to the issue tickets at the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

openrouteservice – geodata!

(This article was first published on R – Insights of a PhD, and kindly contributed to R-bloggers)

The openrouteservice provides a new method to get geodata into R. It has an API (or a set of them) and an R package has been written to communicate with said API(s) and is available from GitHub. I’ve just been playing around with the examples on this page, in the thought of using it for a project (more on that later if I get anywhere with it).

Anyways…onto the code…which is primarily a modification from the examples page I mentioned earlier (see that page for more examples).


Load some libraries


Set the API key


Locations of interest and send the request to the API asking for the region that is accessible within a 15 minute drive of the coordinates.

coordinates <- list(c(8.55, 47.23424), c(8.34234, 47.23424), c(8.44, 47.4))

x <- ors_isochrones(coordinates, 
 range = 60*15, # maximum time to travel (15 mins)
 interval = 60*15, # results in bands of 60*15 seconds (15 mins)
 intersections=FALSE) # no intersection of polygons

By changing the interval to, say, 60*5, three regions per coordinate are returned representing regions accessible within 5, 10 and 15 minutes drive. Changing the intersections argument would produce a separate polygon for any overlapping regions. The information of the intersected polygons is limited though, so it might be better to do the intersection with other tools afterwards.

The results can be plotted with leaflet…

leaflet() %>%
 addTiles() %>%
 addGeoJSON(x) %>%

The blue regions are the three regions accessible within 15 minutes. A few overlapping regions are evident, each of which would be saved to a unique polygon had we set intersections to TRUE.

The results from the API come down in a GeoJSON format which is given a class of, in this case ors_isochrones, which isn’t recognized by so many formats so you might want to convert it to an sp object, giving access to all of the tools for those formats. That’s easy enough to do via the geojsonio package…

class(x) <- "geo_list"
y <- geojson_sp(x)


You can also derive coordinates from (partial) addresses. Here is an example for a region of Bern in Switzerland, using the postcode.

coord <- ors_geocode("3012, Switzerland")

This resulted in 10 hits, the first of which was correct…the others were in different countries…

unlist(lapply(coord$features, function(x) x$properties$label))
[1] "3012, Bern, Switzerland"                                
 [2] "A1, Bern, Switzerland"                                  
 [3] "Bremgartenstrasse, Bern, Switzerland"                   
 [4] "131 Bremgartenstrasse, Bern, Switzerland"               
 [5] "Briefeinwurf Bern, Gymnasium Neufeld, Bern, Switzerland"
 [6] "119 Bremgartenstrasse, Bern, Switzerland"               
 [7] "Gym Neufeld, Bern, Switzerland"                         
 [8] "131b Bremgartenstrasse, Bern, Switzerland"              
 [9] "Gebäude Nord, Bern, Switzerland"                        
[10] "113 Bremgartenstrasse, Bern, Switzerland"

The opposite (coordinate to address) is also possible, again returning multiple hits…

address <- ors_geocode(location = c(7.425898, 46.961598))
unlist(lapply(address$features, function(x) x$properties$label))
[1] "3012, Bern, Switzerland" 
[2] "A1, Bern, Switzerland" 
[3] "Bremgartenstrasse, Bern, Switzerland" 
[4] "131 Bremgartenstrasse, Bern, Switzerland" 
[5] "Briefeinwurf Bern, Gymnasium Neufeld, Bern, Switzerland" 
[6] "119 Bremgartenstrasse, Bern, Switzerland" 
[7] "Gym Neufeld, Bern, Switzerland" 
[8] "131b Bremgartenstrasse, Bern, Switzerland" 
[9] "Gebäude Nord, Bern, Switzerland" 
[10] "113 Bremgartenstrasse, Bern, Switzerland" 

Other options are distances/times/directions between points and places of interest (POI) near a point or within a region.

Hope that helps someone! Enjoy!



To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Create Code Metrics with cloc

(This article was first published on R –, and kindly contributed to R-bloggers)

The cloc Perl script (yes, Perl!) by Al Danial ( has been one of the go-to tools for generating code metrics. Given a single file, directory tree, archive, or git repo, cloc can speedily give you metrics on the count of blank lines, comment lines, and physical lines of source code in a vast array of programming languages.

I don’t remember the full context but someone in the R community asked about about this type of functionality and I had tossed together a small script-turned-package to thinly wrap the Perl cloc utility. Said package was and is unimaginatively named cloc🔗. Thanks to some collaborative input from @ma_salmon, the package gained more features. Recently I added the ability to process R markdown (Rmd) files (i.e. only count lines in code chunks) to the main cloc Perl script and was performing some general cleanup when the idea to create some RStudio addins hit me.

cloc Basics

As noted, you can cloc just about anything. Here’s some metrics for dplyr::group_by:

## # A tibble: 1 x 10
##   source language file_count file_count_pct   loc loc_pct blank_lines blank_line_pct comment_lines comment_line_pct
## 1 group… R                 1             1.    44      1.          13             1.           110               1.

and, here’s a similar set of metrics for the whole dplyr package:

## # A tibble: 7 x 11
##   source language file_count file_count_pct   loc loc_pct blank_lines blank_line_pct comment_lines comment_line_pct
## 1 dplyr… R               148        0.454   13216 0.442          2671       0.380             3876          0.673  
## 2 dplyr… C/C++ H…        125        0.383    6687 0.223          1836       0.261              267          0.0464 
## 3 dplyr… C++              33        0.101    4724 0.158           915       0.130              336          0.0583 
## 4 dplyr… HTML             11        0.0337   3602 0.120           367       0.0522              11          0.00191
## 5 dplyr… Markdown          2        0.00613  1251 0.0418          619       0.0880               0          0.     
## 6 dplyr… Rmd               6        0.0184    421 0.0141          622       0.0884            1270          0.220  
## 7 dplyr… C                 1        0.00307    30 0.00100           7       0.000995             0          0.     
## # ... with 1 more variable: pkg 

We can also measure (in bulk) from afar, such as the measuring the dplyr git repo:

## # A tibble: 12 x 10
##    source    language     file_count file_count_pct   loc  loc_pct blank_lines blank_line_pct comment_lines
##  1 dplyr.git HTML                108        0.236   21467 0.335           3829       0.270             1114
##  2 dplyr.git R                   156        0.341   13648 0.213           2682       0.189             3736
##  3 dplyr.git Markdown             12        0.0263  10100 0.158           3012       0.212                0
##  4 dplyr.git C/C++ Header        126        0.276    6891 0.107           1883       0.133              271
##  5 dplyr.git CSS                   2        0.00438  5684 0.0887          1009       0.0711              39
##  6 dplyr.git C++                  33        0.0722   5267 0.0821          1056       0.0744             393
##  7 dplyr.git Rmd                   7        0.0153    447 0.00697          647       0.0456            1309
##  8 dplyr.git XML                   1        0.00219   291 0.00454            0       0.                   0
##  9 dplyr.git YAML                  6        0.0131    212 0.00331           35       0.00247             12
## 10 dplyr.git JavaScript            2        0.00438    44 0.000686          10       0.000705             4
## 11 dplyr.git Bourne Shell          3        0.00656    34 0.000530          15       0.00106             10
## 12 dplyr.git C                     1        0.00219    30 0.000468           7       0.000493             0
## # ... with 1 more variable: comment_line_pct

All in on Addins

The Rmd functionality made me realize that some interactive capabilities might be handy, so I threw together three of them.

Two of them extraction of code chunks from Rmd documents. One uses cloc other uses knitr::purl() (h/t @yoniceedee). The knitr one adds in some very nice functionality if you want to preserve chunk options and have “eval=FALSE” chunks commented out.

The final one will gather up code metrics for all the sources in an active project.


If you’d like additional features or want to contribute, give ( a visit and drop an issue or PR.

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Magister Dixit

“We didn’t know in the past that strawberry Pop-Tarts increase in sales, like seven times their normal sales rate, ahead of a hurricane, and the pre-hurricane top-selling item was beer.” Mrs. Dillman ( 2013 )

Continue Reading…


Read More

An East-West less divided?

(This article was first published on R – thinkr, and kindly contributed to R-bloggers)

With tensions heightened recently at the United Nations, one might wonder whether we’ve drawn closer, or farther apart, over the decades since the UN was established in 1945.

We’ll see if we can garner a clue by performing cluster analysis on the General Assembly voting of five of the founding members. We’ll focus on the five permanent members of the Security Council. Then later on we can look at whether Security Council vetoes corroborate our findings.

A prior article, entitled the “cluster of six“, employed unsupervised machine learning to discover the underlying structure of voting data. We’ll use related techniques here to explore the voting history of the General Assembly, the only organ of the United Nations in which all 193 member states have equal representation.

By dividing the voting history into two equal parts, which we’ll label as the “early years” and the “later years”, we can assess how our five nations cluster in the two eras.

During the early years, France, the UK and the US formed one cluster, whilst Russia stood apart.

Although the Republic of China (ROC) joined the UN at its founding in 1945, it’s worth noting that the People’s Republic of China (PRC), commonly called China today, was admitted into the UN in 1971. Hence its greater distance in the clustering.

Through the later years, France and the UK remained close. Not surprising given our EU ties. Will Brexit have an impact going forward?

The US is slightly separated from its European allies, but what’s more striking, is the shorter distance between these three and China / Russia. Will globalization continue to bring us closer together, or is the tide about to turn?

The cluster analysis above focused on General Assembly voting. By web-scraping the UN’s Security Council Veto List, we can acquire further insights on the voting patterns of our five nations.

Russia dominated the early vetoes before these dissipated in the late 60s. Vetoes picked up again in the 70s with the US dominating through to the 80s. China has been the most restrained throughout.

Since the 90s, there would appear to be less dividing us, supporting our finding from the General Assembly voting. But do the vetoes in 2017, and so far in 2018, suggest a turning of the tide? Or just a temporary divergence?

R toolkit

R packages and functions (excluding base) used throughout this analysis.

  Packages Functions
purrr map_dbl[3]; map[1]; map2_df[1]; possibly[1]; set_names[1]
XML readHTMLTable[1]
dplyr if_else[15]; mutate[9]; filter[6]; select[5]; group_by[3]; summarize[3]; distinct[2]; inner_join[2]; slice[2]; arrange[1]; as_data_frame[1]; as_tibble[1]; data_frame[1]; desc[1]; rename[1]
tibble as_data_frame[1]; as_tibble[1]; data_frame[1]; enframe[1]; rowid_to_column[1]
stringr str_c[8]; str_detect[6]; str_replace[3]; fixed[2]; str_remove[2]; str_count[1]
rebus dgt[1]; literal[1]; lookahead[1]; lookbehind[1]
lubridate year[7]; dmy[1]; today[1]; ymd[1]
tidyr spread[3]; gather[2]; unnest[1]
cluster pam[3]
ggplot2 aes[6]; ggplot[5]; ggtitle[5]; scale_x_continuous[5]; element_blank[4]; geom_text[4]; geom_line[3]; geom_point[3]; ylim[3]; element_rect[2]; geom_col[2]; labs[2]; scale_fill_manual[2]; theme[2]; coord_flip[1]
factoextra fviz_cluster[3]; fviz_dend[1]; fviz_silhouette[1]; hcut[1]
cowplot draw_plot[2]; ggdraw[1]
ggthemes theme_economist[1]
kableExtra kable[1]; kable_styling[1]
knitr kable[1]

View the code here.

Citations / Attributions

R Development Core Team (2008). R: A language and environment for
statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. ISBN 3-900051-07-0, URL

Erik Voeten “Data and Analyses of Voting in the UN General Assembly” Routledge Handbook of International Organization, edited by Bob Reinalda (published May 27, 2013)

The post An East-West less divided? appeared first on thinkr.

To leave a comment for the author, please follow the link and comment on their blog: R – thinkr. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

MIT Statistics and Data Science MicroMasters

MIT has recently launched Statistics and Data Science MicroMasters program. The program is a series of online MIT graduate courses offered via EdX. It officially starts in the fall of 2018.


Continue Reading…


Read More

Regularized Prediction and Poststratification (the generalization of Mister P)

This came up in comments recently so I thought I’d clarify the point.

Mister P is MRP, multilevel regression and poststratification. The idea goes like this:

1. You want to adjust for differences between sample and population. Let y be your outcome of interest and X be your demographic and geographic variables you’d like to adjust for. Assume X is discrete so you can define a set of poststratification cells, j=1,…,J (for example, if you’re poststratifying on 4 age categories, 5 education categories, 4 ethnicity categories, and 50 states, then J=4*5*4*50, and the cells might go from 18-29-year-old no-high-school-education whites in Alabama, to over-65-year-old, post-graduate-education latinos in Wyoming). Each cell j has a population N_j from the census.

2. You fit a regression model y | X to data, to get a predicted average response for each person in the population, conditional on their demographic and geographic variables. You’re thus estimating theta_j, for j=1,…,J. The {\em regression} part of MRP comes in because you need to make these predictions.

3. Given point estimates of theta, you can estimate the population average as sum_j (N_j*theta_j) / sum_j (N_j). Or you can estimate various intermediate-level averages (for example, state-level results) using partial sums over the relevant subsets of the poststratification cells.

4. In the Bayesian version (for example, using Stan), you get a matrix of posterior simulations, with each row of the matrix representing one simulation draw of the vector theta; this then propagates to uncertainties in any poststrat averages.

5. The {\em multilevel} part of MRP comes because you want to adjust for lots of cells j in your poststrat, so you’ll need to estimate lots of parameters theta_j in your regression, and multilevel regression is one way to get stable estimates with good predictive accuracy.

OK, fine. The point is: poststratification is key. It’s all about (a) adjusting for many ways in which your sample isn’t representative of the population, and (b) getting estimates for population subgroups of interest.

But it’s not crucial that the theta_j’s be estimated using multilevel regression. More generally, we can use any {\em regularized prediction} method that gives reasonable and stable estimates while including a potentially large number of predictors.

Hence, regularized prediction and poststratification. RPP. It doesn’t sound quite as good as MRP but it’s the more general idea.

The post Regularized Prediction and Poststratification (the generalization of Mister P) appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

Do Clustering by “Dimensional Collapse”

(This article was first published on, and kindly contributed to R-bloggers)


Image that someone in a bank want to find out whether some of bank’s credit card holders are acctually the same person, so according to his experience, he set a rule: the people share either the same address or the same phone number can be reasonably regarded as the same person. Just as the example:

a <- data_frame(id = 1:16,
                addr = c("a", "a", "a", "b", "b", "c", "d", "d", "d", "e", "e", "f", "f", "g", "g", "h"),
                phone = c(130L, 131L, 132L, 133L, 134L, 132L, 135L, 136L, 137L, 136L, 138L, 138L, 139L, 140L, 141L, 139L),
                flag = c(1L, 1L, 1L, 2L, 2L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 3L))


## id   addr    phone   flag
## 1    a   130 1
## 2    a   131 1
## 3    a   132 1
## 4    b   133 2
## 5    b   134 2
## 6    c   132 1 

In the dataframe


, the letters in column


stand for address information, the numbers in column


stand for phone numbers, and the integers in column


is what he want: the CLUSTER flag which means “really” different persons.

The records in a bank

In the above plot, each point stand for a “identity” who has a address which you can tell according to horizontal axis , and a phone number which you can see in vertical axis. The red dotted line present the “connections” betweent identities, which actually means the same address or phone number. So the wanted result is the blue rectangels to circle out different flags which reprent really different persons.


The “finding the same person” thing is typically a clustring process, and I am very sure there are pretty many ways to do it, Such as Disjoint-set data structure. But, I can not help thinking mayby we can make it in a simple way with R. that’s my goal.

“Dimensional Collapse”

When I stared at the plot, I ask myself, why not map the x-axis information of the points to the very first one according to the y-axis “connections”. When everything goes well and all done, all the grey points should be mapped along the red arrows to the first marks of the groups, and there should be only 4 marks leave on x-axis: a, b, d and g, instead of 9 marks in the first place. And the y-axis information, after contributing all the “connection rules”, can be put away now, since the left x-axis marks are exactly what I want: the final flags. It is why I like to call it “Dimensional Collapse”.

Furthermore, in order to take advantage of R properties, I also:
1. Treat both dimensions as integers by factoring them.
2. Use “integer subsetting” to map and collapse.

Dimensional Collapse

axis_collapse <- function(df, .x, .y) {
    .x <- enquo(.x)
    .y <- enquo(.y)
    # Turn the address and phone number into integers.
    df <- mutate(df,
                 axis_x = c(factor(!!.x)),
                 axis_y = c(factor(!!.y)))
    oldRule <- seq_len(max(df$axis_x))
    mapRule <- df %>%
      select(axis_x, axis_y) %>%
      group_by(axis_y) %>%
      arrange(axis_x, .by_group = TRUE) %>%
      mutate(collapse = axis_x[1]) %>%
      ungroup() %>%
      select(-axis_y) %>%
      distinct() %>%
      group_by(axis_x) %>%
      arrange(collapse, .by_group = TRUE) %>%
      slice(1) %>%  
      ungroup() %>%
      arrange(axis_x) %>%
    # Use integer subsetting to collapse x-axis.
    # In case of indirect "connections", we should do it recursively.
    while (TRUE) {
        newRule <- mapRule[oldRule]
        if(identical(newRule, oldRule)) {
        } else {
            oldRule <- newRule
    df <- df %>%
      mutate(flag = newRule[axis_x],
             flag = c(factor(flag))) %>%

Let see the result.

a %>%
  rename(flag_t = flag) %>% 
  axis_collapse(addr, phone) %>%
  mutate_at(.vars = vars(addr:flag), factor) %>%
  ggplot(aes(factor(addr), factor(phone), shape = flag_t, color = flag)) +
  geom_point(size = 3) +
  labs(x = "Address", y = "Phone Number", shape = "Target Flag:", color = "Cluster Flag:")

The Clustering result compared to the target

Not bad so far.

Calculation Complexity

Let make a simple test about time complexity.

test1 <- data_frame(addr = sample(1:1e4, 1e4), phone = sample(1:1e4, 1e4))
test2 <- data_frame(addr = sample(1:1e5, 1e5), phone = sample(1:1e5, 1e5))

bm <- microbenchmark::microbenchmark(n10k = axis_collapse(test1, addr, phone),
                                     n100k = axis_collapse(test2, addr, phone),
                                     times = 30)


## expr min lq  mean    median  uq  max neval   cld
## n10k     249.2172    259.918     277.0333    266.9297    279.505     379.4292    30  a
## n100k    2489.1834   2581.731    2640.9394   2624.5741   2723.390    2839.5180   30  b 

It seems that the growth of consumed time is in linear relationship with data increase holding the other conditions unchanged. That is acceptable.

More Dimensions?

To me, since this method collapse one dimension by transfering their clustering information to the other dimension, so the method should can be used resursively on more than 2 dimensions. But I am not 100% sure. Let do a simple test.

a %>%
  # I deliberately add a column which connect group 2 and 4 only.
  mutate(other = c(LETTERS[1:14], "D", "O")) %>%
  # use axis_collapse recursively
  axis_collapse(other, phone) %>%
  axis_collapse(flag, addr) %>%
  ggplot(aes(x = factor(addr), y = factor(phone), color = factor(flag))) +
  geom_point(size = 3) +
  labs(x = "Address", y = "Phone Number", color = "Cluster Flag:")

Dimensional Collapse when more than 2 dimensions

To leave a comment for the author, please follow the link and comment on their blog: offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Decision Modelling in R Workshop in The Netherlands!

(This article was first published on, and kindly contributed to R-bloggers)

The Decision Analysis in R for Technologies in Health (DARTH) workgroup is hosting a two-day workshop on decision analysis in R in Leiden, The Netherlands from June 7-8, 2018. A one-day introduction to R course will also be offered the day before the workshop, on June 6th.

Decision models are mathematical simulation models that are increasingly being used in health sciences to simulate the impact of policy decisions on population health. New methodological techniques around decision modeling are being developed that rely heavily on statistical and mathematical techniques. R is becoming increasingly popular in decision analysis as it provides a flexible environment where advanced statistical methods can be combined with decision models of varying complexity. Also, the fact that R is freely available improves model transparency and reproducibility.

The workshop will guide participants on building probabilistic decision trees, Markov models and microsimulations, creating publication-quality tabular and graphical output, and will provide a basic introduction to value of information methods and model calibration using R.

For more information and to register, please visit:

To leave a comment for the author, please follow the link and comment on their blog: offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Why expensive weddings are a bad idea

GETTING hitched is not cheap. Various estimates put the cost of a typical British wedding at anywhere between £18,000 and £25,000 ($24,200 and $33,700), roughly eight to eleven months of disposable income for the median household.

Continue Reading…


Read More

Book Memo: “Mathematical Optimization Terminology”

A Comprehensive Glossary of Terms
Mathematical Optimization Terminology: A Comprehensive Glossary of Terms is a practical book with the essential formulations, illustrative examples, real-world applications and main references on the topic. This book helps readers gain a more practical understanding of optimization, enabling them to apply it to their algorithms. This book also addresses the need for a practical publication that introduces these concepts and techniques.

Continue Reading…


Read More

Book Memo: “Deep Learning From First Principles”

In vectorized Python, R and Octave
This book derives and builds a multi-layer, multi-unit Deep Learning from the basics. The first chapter starts with the derivation and implementation of Logistic Regression as a Neural Network. This followed by building a generic L-Layer Deep Learning Network which performs binary classification. This Deep Learning network is then enhanced to handle multi-class classification along with the necessary derivations for the Jacobian of softmax and cross-entropy loss. Further chapters include different initialization types, regularization methods (L2, dropout) followed by gradient descent optimization techniques like Momentum, Rmsprop and Adam. Finally the technique of gradient checking is elaborated and implemented. All the chapters include implementations in vectorized Python, R and Octave. Detailed derivations are included for each critical enhancement to the Deep Learning. By the time you reach the last chapter, the implementation includes fully function L-Layer Deep Learning with all the bells and whistles in vectorized Python, R and Octave. The code for all the chapters have been included in the Appendix section

Continue Reading…


Read More

R Packages worth a look

Advanced ‘tryCatch()’ and ‘try()’ Functions (tryCatchLog)
Advanced tryCatch() and try() functions for better error handling (logging, stack trace with source code references and support for post-mortem analysis).

Dat’ Protocol Interface (datr)
Interface with the ‘Dat’ p2p network protocol <>. Clone archives from the network, share your own files, and install packages from the network.

Sequence Clustering with Discrete-Output HMMs (DBHC)
Provides an implementation of a mixture of hidden Markov models (HMMs) for discrete sequence data in the Discrete Bayesian HMM Clustering (DBHC) algorithm. The DBHC algorithm is an HMM Clustering algorithm that finds a mixture of discrete-output HMMs while using heuristics based on Bayesian Information Criterion (BIC) to search for the optimal number of HMM states and the optimal number of clusters.

Continue Reading…


Read More

wrapr 1.4.1 now up on CRAN

wrapr 1.4.1 is now available on CRAN. wrapr is a really neat R package both organizing, meta-programming, and debugging R code. This update generalizes the dot-pipe feature’s dot S3 features.

Please give it a try!

wrapr, is an R package that supplies powerful tools for writing and debugging R code.


Primary wrapr services include:

  • let() (let block)
  • %.>% (dot arrow pipe)
  • build_frame()/draw_frame()
  • qc() (quoting concatenate)
  • := (named map builder)
  • DebugFnW() (function debug wrappers)
  • λ() (anonymous function builder)


let() allows execution of arbitrary code with substituted variable names (note this is subtly different than binding values for names as with base::substitute() or base::with()).

The function is simple and powerful. It treats strings as variable names and re-writes expressions as if you had used the denoted variables. For example the following block of code is equivalent to having written "a + a".


a <- 7

  c(VAR = 'a'),
 #  [1] 14

This is useful in re-adapting non-standard evaluation interfaces (NSE interfaces) so one can script or program over them.

We are trying to make let() self teaching and self documenting (to the extent that makes sense). For example try the arguments "eval=FALSE" prevent execution and see what would have been executed, or debug=TRUE to have the replaced code printed in addition to being executed:

  c(VAR = 'a'),
  eval = FALSE,
    VAR + VAR
 #  {
 #      a + a
 #  }

  c(VAR = 'a'),
  debugPrint = TRUE,
    VAR + VAR
 #  $VAR
 #  [1] "a"
 #  {
 #      a + a
 #  }
 #  [1] 14

Please see vignette('let', package='wrapr') for more examples. Some formal documentation can be found here.

For working with dplyr 0.7.* we strongly suggest wrapr::let() (or even an alternate approach called seplyr).

%.>% (dot pipe or dot arrow)

%.>% dot arrow pipe is a pipe with intended semantics:

"a %.>% b" is to be treated approximately as if the user had written "{ . <- a; b };" with "%.>%" being treated as left-associative.

Other R pipes include magrittr and pipeR.

The following two expressions should be equivalent:

 #  [1] 0.8919465

4 %.>% sin(.) %.>% exp(.) %.>% cos(.)
 #  [1] 0.8919465

The notation is quite powerful as it treats pipe stages as expression parameterized over the variable ".". This means you do not need to introduce functions to express stages. The following is a valid dot-pipe:

1:4 %.>% .^2 
 #  [1]  1  4  9 16

The notation is also very regular as we show below.

1:4 %.>% sin
 #  [1]  0.8414710  0.9092974  0.1411200 -0.7568025
1:4 %.>% sin(.)
 #  [1]  0.8414710  0.9092974  0.1411200 -0.7568025
1:4 %.>% base::sin
 #  [1]  0.8414710  0.9092974  0.1411200 -0.7568025
1:4 %.>% base::sin(.)
 #  [1]  0.8414710  0.9092974  0.1411200 -0.7568025

1:4 %.>% function(x) { x + 1 }
 #  [1] 2 3 4 5
1:4 %.>% (function(x) { x + 1 })
 #  [1] 2 3 4 5

1:4 %.>% { .^2 } 
 #  [1]  1  4  9 16
1:4 %.>% ( .^2 )
 #  [1]  1  4  9 16

Regularity can be a big advantage in teaching and comprehension. Please see "In Praise of Syntactic Sugar" for more details. Some formal documentation can be found here.

  • Some obvious "dot-free"" right-hand sides are rejected. Pipelines are meant to move values through a sequence of transforms, and not just for side-effects. Example: 5 %.>% 6 deliberately stops as 6 is a right-hand side that obviously does not use its incoming value. This check is only applied to values, not functions on the right-hand side.
  • Trying to pipe into a an "zero argument function evaluation expression" such as sin() is prohibited as it looks too much like the user declaring sin() takes no arguments. One must pipe into either a function, function name, or an non-trivial expression (such as sin(.)). A useful error message is returned to the user: wrapr::pipe does not allow direct piping into a no-argument function call expression (such as "sin()" please use sin(.)).
  • Some reserved words can not be piped into. One example is 5 %.>% return(.) is prohibited as the obvious pipe implementation would not actually escape from user functions as users may intend.
  • Obvious de-references (such as $, ::, @, and a few more) on the right-hand side are treated performed (example: 5 %.>% base::sin(.)).
  • Outer parenthesis on the right-hand side are removed (example: 5 %.>% (sin(.))).
  • Anonymous function constructions are evaluated so the function can be applied (example: 5 %.>% function(x) {x+1} returns 6, just as 5 %.>% (function(x) {x+1})(.) does).
  • Checks and transforms are not performed on items inside braces (example: 5 %.>% { function(x) {x+1} } returns function(x) {x+1}, not 6).


build_frame() is a convenient way to type in a small example data.frame in natural row order. This can be very legible and saves having to perform a transpose in one’s head. draw_frame() is the complimentary function that formats a given data.frame (and is a great way to produce neatened examples).

x <- build_frame(
   "measure"                   , "training", "validation" |
   "minus binary cross entropy", 5         , -7           |
   "accuracy"                  , 0.8       , 0.6          )
 #                       measure training validation
 #  1 minus binary cross entropy      5.0       -7.0
 #  2                   accuracy      0.8        0.6
 #  'data.frame':   2 obs. of  3 variables:
 #   $ measure   : chr  "minus binary cross entropy" "accuracy"
 #   $ training  : num  5 0.8
 #   $ validation: num  -7 0.6
 #  build_frame(
 #     "measure"                   , "training", "validation" |
 #     "minus binary cross entropy", 5         , -7           |
 #     "accuracy"                  , 0.8       , 0.6          )

qc() (quoting concatenate)

qc() is a quoting variation on R‘s concatenate operator c(). This code such as the following:

qc(a = x, b = y)
 #    a   b 
 #  "x" "y"

qc(one, two, three)
 #  [1] "one"   "two"   "three"

:= (named map builder)

:= is the "named map builder". It allows code such as the following:

'a' := 'x'
 #    a 
 #  "x"

The important property of named map builder is it accepts values on the left-hand side allowing the following:

name <- 'variableNameFromElsewhere'
name := 'newBinding'
 #  variableNameFromElsewhere 
 #               "newBinding"

A nice property is := commutes (in the sense of algebra or category theory) with R‘s concatenation function c(). That is the following two statements are equivalent:

c('a', 'b') := c('x', 'y')
 #    a   b 
 #  "x" "y"

c('a' := 'x', 'b' := 'y')
 #    a   b 
 #  "x" "y"

The named map builder is designed to synergize with seplyr.


DebugFnW() wraps a function for debugging. If the function throws an exception the execution context (function arguments, function name, and more) is captured and stored for the user. The function call can then be reconstituted, inspected and even re-run with a step-debugger. Please see our free debugging video series and vignette('DebugFnW', package='wrapr') for examples.

λ() (anonymous function builder)

λ() is a concise abstract function creator or "lambda abstraction". It is a placeholder that allows the use of the -character for very concise function abstraction.


# Make sure lambda function builder is in our enironment.

# square numbers 1 through 4
sapply(1:4, λ(x, x^2))
 #  [1]  1  4  9 16


Install with either:



# install.packages("devtools")

More Information

More details on wrapr capabilities can be found in the following two technical articles:


Note: wrapr is meant only for "tame names", that is: variables and column names that are also valid simple (without quotes) R variables names.

Continue Reading…


Read More

Whats new on arXiv

Generalized Strucutral Causal Models

Structural causal models are a popular tool to describe causal relations in systems in many fields such as economy, the social sciences, and biology. In this work, we show that these models are not flexible enough in general to give a complete causal representation of equilibrium states in dynamical systems that do not have a unique stable equilibrium independent of initial conditions. We prove that our proposed generalized structural causal models do capture the essential causal semantics that characterize these systems. We illustrate the power and flexibility of this extension on a dynamical system corresponding to a basic enzymatic reaction. We motivate our approach further by showing that it also efficiently describes the effects of interventions on functional laws such as the ideal gas law.

Efficient compilation of array probabilistic programs

Probabilistic programming languages are valuable because they allow us to build, run, and change concise probabilistic models that elide irrelevant details. However, current systems are either inexpressive, failing to support basic features needed to write realistic models, or inefficient, taking orders of magnitude more time to run than hand-coded inference. Without resolving this dilemma, model developers are still required to manually rewrite their high-level models into low-level code to get the needed performance. We tackle this dilemma by presenting an approach for efficient probabilistic programming with arrays. Arrays are a key element of almost any realistic model. Our system extends previous compilation techniques from scalars to arrays. These extensions allow the transformation of high-level programs into known efficient algorithms. We then optimize the resulting code by taking advantage of the domain-specificity of our language. We further JIT-compile the final product using LLVM on a per-execution basis. These steps combined lead to significant new opportunities for specialization. The resulting performance is competitive with manual implementations of the desired algorithms, even though the original program is as concise and expressive as the initial model.

Causal Inference from Strip-Plot Designs in a Potential Outcomes Framework

Strip-plot designs are very useful when the treatments have a factorial structure and the factors levels are hard-to-change. We develop a randomization-based theory of causal inference from such designs in a potential outcomes framework. For any treatment contrast, an unbiased estimator is proposed, an expression for its sampling variance is worked out, and a conservative estimator of the sampling variance is obtained. This conservative estimator has a nonnegative bias, and becomes unbiased under between-block additivity, a condition milder than Neymannian strict additivity. A minimaxity property of this variance estimator is also established. Simulation results on the coverage of resulting confidence intervals lend support to theoretical considerations.

GANomaly: Semi-Supervised Anomaly Detection via Adversarial Training

Anomaly detection is a classical problem in computer vision, namely the determination of the normal from the abnormal when datasets are highly biased towards one class (normal) due to the insufficient sample size of the other class (abnormal). While this can be addressed as a supervised learning problem, a significantly more challenging problem is that of detecting the unknown/unseen anomaly case that takes us instead into the space of a one-class, semi-supervised learning paradigm. We introduce such a novel anomaly detection model, by using a conditional generative adversarial network that jointly learns the generation of high-dimensional image space and the inference of latent space. Employing encoder-decoder-encoder sub-networks in the generator network enables the model to map the input image to a lower dimension vector, which is then used to reconstruct the generated output image. The use of the additional encoder network maps this generated image to its latent representation. Minimizing the distance between these images and the latent vectors during training aids in learning the data distribution for the normal samples. As a result, a larger distance metric from this learned data distribution at inference time is indicative of an outlier from that distribution – an anomaly. Experimentation over several benchmark datasets, from varying domains, shows the model efficacy and superiority over previous state-of-the-art approaches.

Matching Consecutive Subpatterns Over Streaming Time Series

Pattern matching of streaming time series with lower latency under limited computing resource comes to a critical problem, especially as the growth of Industry 4.0 and Industry Internet of Things. However, against traditional single pattern matching model, a pattern may contain multiple subpatterns representing different physical meanings in the real world. Hence, we formulate a new problem, called ‘consecutive subpatterns matching’, which allows users to specify a pattern containing several consecutive subpatterns with various specified thresholds. We propose a novel representation Equal-Length Block (ELB) together with two efficient implementations, which work very well under all Lp-Norms without false dismissals. Extensive experiments are performed on synthetic and real-world datasets to illustrate that our approach outperforms the brute-force method and MSM, a multi-step filter mechanism over the multi-scaled representation by orders of magnitude.

The Blessings of Multiple Causes

Causal inference from observation data often assumes ‘strong ignorability,’ that all confounders are observed. This assumption is standard yet untestable. However, many scientific studies involve multiple causes, different variables whose effects are simultaneously of interest. We propose the deconfounder, an algorithm that combines unsupervised machine learning and predictive model checking to perform causal inference in multiple-cause settings. The deconfounder infers a latent variable as a substitute for unobserved confounders and then uses that substitute to perform causal inference. We develop theory for when the deconfounder leads to unbiased causal estimates, and show that it requires weaker assumptions than classical causal inference. We analyze its performance in three types of studies: semi-simulated data around smoking and lung cancer, semi-simulated data around genomewide association studies, and a real dataset about actors and movie revenue. The deconfounder provides a checkable approach to estimating close-to-truth causal effects.

Text classification based on ensemble extreme learning machine

In this paper, we propose a novel approach based on cost-sensitive ensemble weighted extreme learning machine; we call this approach AE1-WELM. We apply this approach to text classification. AE1-WELM is an algorithm including balanced and imbalanced multiclassification for text classification. Weighted ELM assigning the different weights to the different samples improves the classification accuracy to a certain extent, but weighted ELM considers the differences between samples in the different categories only and ignores the differences between samples within the same categories. We measure the importance of the documents by the sample information entropy, and generate cost-sensitive matrix and factor based on the document importance, then embed the cost-sensitive weighted ELM into the AdaBoost.M1 framework seamlessly. Vector space model(VSM) text representation produces the high dimensions and sparse features which increase the burden of ELM. To overcome this problem, we develop a text classification framework combining the word vector and AE1-WELM. The experimental results show that our method provides an accurate, reliable and effective solution for text classification.

Hybrid Adaptive Fuzzy Extreme Learning Machine for text classification

In traditional ELM and its improved versions suffer from the problems of outliers or noises due to overfitting and imbalance due to distribution. We propose a novel hybrid adaptive fuzzy ELM(HA-FELM), which introduces a fuzzy membership function to the traditional ELM method to deal with the above problems. We define the fuzzy membership function not only basing on the distance between each sample and the center of the class but also the density among samples which based on the quantum harmonic oscillator model. The proposed fuzzy membership function overcomes the shortcoming of the traditional fuzzy membership function and could make itself adjusted according to the specific distribution of different samples adaptively. Experiments show the proposed HA-FELM can produce better performance than SVM, ELM, and RELM in text classification.

First Experiments with Neural Translation of Informal to Formal Mathematics

We report on our first experiments to train deep neural networks that automatically translate informalized \LaTeX{}-written Mizar texts into the formal Mizar language. Using Luong et al.’s neural machine translation model (NMT), we tested our aligned informal-formal corpora against various hyperparameters and evaluated their results. Our experiments show that NMT is able to generate correct Mizar statements on more than 60 percent of the inference data, indicating that formalization through artificial neural network is a promising approach for automated formalization of mathematics. We present several case studies to illustrate our results.

Weight Initialization in Neural Language Models

Semantic Similarity is an important application which finds its use in many downstream NLP applications. Though the task is mathematically defined, semantic similarity’s essence is to capture the notions of similarity impregnated in humans. Machines use some heuristics to calculate the similarity between words, but these are typically corpus dependent or are useful for specific domains. The difference between Semantic Similarity and Semantic Relatedness motivates the development of new algorithms. For a human, the word car and road are probably as related as car and bus. But this may not be the case for computational methods. Ontological methods are good at encoding Semantic Similarity and Vector Space models are better at encoding Semantic Relatedness. There is a dearth of methods which leverage ontologies to create better vector representations. The aim of this proposal is to explore in the direction of a hybrid method which combines statistical/vector space methods like Word2Vec and Ontological methods like WordNet to leverage the advantages provided by both.

End-to-end Learning of a Convolutional Neural Network via Deep Tensor Decomposition

In this paper we study the problem of learning the weights of a deep convolutional neural network. We consider a network where convolutions are carried out over non-overlapping patches with a single kernel in each layer. We develop an algorithm for simultaneously learning all the kernels from the training data. Our approach dubbed Deep Tensor Decomposition (DeepTD) is based on a rank-1 tensor decomposition. We theoretically investigate DeepTD under a realizable model for the training data where the inputs are chosen i.i.d. from a Gaussian distribution and the labels are generated according to planted convolutional kernels. We show that DeepTD is data-efficient and provably works as soon as the sample size exceeds the total number of convolutional weights in the network. We carry out a variety of numerical experiments to investigate the effectiveness of DeepTD and verify our theoretical findings.

Career Transitions and Trajectories: A Case Study in Computing

From artificial intelligence to network security to hardware design, it is well-known that computing research drives many important technological and societal advancements. However, less is known about the long-term career paths of the people behind these innovations. What do their careers reveal about the evolution of computing research Which institutions were and are the most important in this field, and for what reasons Can insights into computing career trajectories help predict employer retention In this paper we analyze several decades of post-PhD computing careers using a large new dataset rich with professional information, and propose a versatile career network model, R^3, that captures temporal career dynamics. With R^3 we track important organizations in computing research history, analyze career movement between industry, academia, and government, and build a powerful predictive model for individual career transitions. Our study, the first of its kind, is a starting point for understanding computing research careers, and may inform employer recruitment and retention mechanisms at a time when the demand for specialized computational expertise far exceeds supply.

A Spline Theory of Deep Networks (Extended Version)

We build a rigorous bridge between deep networks (DNs) and approximation theory via spline functions and operators. Our key result is that a large class of DNs can be written as a composition of max-affine spline operators (MASOs), which provide a powerful portal through which to view and analyze their inner workings. For instance, conditioned on the input signal, the output of a MASO DN can be written as a simple affine transformation of the input. This implies that a DN constructs a set of signal-dependent, class-specific templates against which the signal is compared via a simple inner product; we explore the links to the classical theory of optimal classification via matched filters and the effects of data memorization. Going further, we propose a simple penalty term that can be added to the cost function of any DN learning algorithm to force the templates to be orthogonal with each other; this leads to significantly improved classifi- cation performance and reduced overfitting with no change to the DN architecture. The spline partition of the input signal space that is implicitly induced by a MASO directly links DNs to the theory of vector quantization (VQ) and K-means clustering, which opens up new geometric avenue to study how DNs organize signals in a hierarchical fashion. To validate the utility of the VQ interpretation, we develop and validate a new distance metric for signals and images that quantifies the difference between their VQ encodings. (This paper is a significantly expanded version of a paper with the same title that will appear at ICML 2018.)

Accelerating Nonnegative Matrix Factorization Algorithms using Extrapolation

In this paper, we propose a general framework to accelerate significantly the algorithms for nonnegative matrix factorization (NMF). This framework is inspired from the extrapolation scheme used to accelerate gradient methods in convex optimization and from the method of parallel tangents. However, the use of extrapolation in the context of the two-block coordinate descent algorithms tackling the non-convex NMF problems is novel. We illustrate the performance of this approach on two state-of-the-art NMF algorithms, namely, accelerated hierarchical alternating least squares (A-HALS) and alternating nonnegative least squares (ANLS), using synthetic, image and document data sets.

Defense-GAN: Protecting Classifiers Against Adversarial Attacks Using Generative Models

In recent years, deep neural network approaches have been widely adopted for machine learning tasks, including classification. However, they were shown to be vulnerable to adversarial perturbations: carefully crafted small perturbations can cause misclassification of legitimate images. We propose Defense-GAN, a new framework leveraging the expressive capability of generative models to defend deep neural networks against such attacks. Defense-GAN is trained to model the distribution of unperturbed images. At inference time, it finds a close output to a given image which does not contain the adversarial changes. This output is then fed to the classifier. Our proposed method can be used with any classification model and does not modify the classifier structure or training procedure. It can also be used as a defense against any attack as it does not assume knowledge of the process for generating the adversarial examples. We empirically show that Defense-GAN is consistently effective against different attack methods and improves on existing defense strategies. Our code has been made publicly available at https://…/defensegan.

Probabilistic Embedding of Knowledge Graphs with Box Lattice Measures

Embedding methods which enforce a partial order or lattice structure over the concept space, such as Order Embeddings (OE) (Vendrov et al., 2016), are a natural way to model transitive relational data (e.g. entailment graphs). However, OE learns a deterministic knowledge base, limiting expressiveness of queries and the ability to use uncertainty for both prediction and learning (e.g. learning from expectations). Probabilistic extensions of OE (Lai and Hockenmaier, 2017) have provided the ability to somewhat calibrate these denotational probabilities while retaining the consistency and inductive bias of ordered models, but lack the ability to model the negative correlations found in real-world knowledge. In this work we show that a broad class of models that assign probability measures to OE can never capture negative correlation, which motivates our construction of a novel box lattice and accompanying probability measure to capture anticorrelation and even disjoint concepts, while still providing the benefits of probabilistic modeling, such as the ability to perform rich joint and conditional queries over arbitrary sets of concepts, and both learning from and predicting calibrated uncertainty. We show improvements over previous approaches in modeling the Flickr and WordNet entailment graphs, and investigate the power of the model.

Birnbaum-Saunders Distribution: A Review of Models, Analysis and Applications

Birnbaum and Saunders introduced a two-parameter lifetime distribution to model fatigue life of a metal, subject to cyclic stress. Since then, extensive work has been done on this model providing different interpretations, constructions, generalizations, inferential methods, and extensions to bivariate, multivariate and matrix-variate cases. More than two hundred papers and one research monograph have already appeared describing all these aspects and developments. In this paper, we provide a detailed review of all these developments and at the same time indicate several open problems that could be considered for further research.

DNN or $k$-NN: That is the Generalize vs. Memorize Question

This paper studies the relationship between the classification performed by deep neural networks and the k-NN decision at the embedding space of these networks. This simple important connection shown here provides a better understanding of the relationship between the ability of neural networks to generalize and their tendency to memorize the training data, which are traditionally considered to be contradicting to each other and here shown to be compatible and complementary. Our results support the conjecture that deep neural networks approach Bayes optimal error rates.

Revisiting the tree edit distance and its backtracing: A tutorial

Almost 30 years ago, Zhang and Shasha published a seminal paper describing an efficient dynamic programming algorithm computing the tree edit distance, that is, the minimum number of node deletions, insertions, and replacements that are necessary to transform one tree into another. Since then, the tree edit distance has had widespread applications, for example in bioinformatics and intelligent tutoring systems. However, the original paper of Zhang and Shasha can be challenging to read for newcomers and it does not describe how to efficiently infer the optimal edit script. In this contribution, we provide a comprehensive tutorial to the tree edit distance algorithm of Zhang and Shasha. We further prove metric properties of the tree edit distance, and describe efficient algorithms to infer the cheapest edit script, as well as a summary of all cheapest edit scripts between two trees.

Neural language representations predict outcomes of scientific research

Many research fields codify their findings in standard formats, often by reporting correlations between quantities of interest. But the space of all testable correlates is far larger than scientific resources can currently address, so the ability to accurately predict correlations would be useful to plan research and allocate resources. Using a dataset of approximately 170,000 correlational findings extracted from leading social science journals, we show that a trained neural network can accurately predict the reported correlations using only the text descriptions of the correlates. Accurate predictive models such as these can guide scientists towards promising untested correlates, better quantify the information gained from new findings, and has implications for moving artificial intelligence systems from predicting structures to predicting relationships in the real world.

Improving End-of-turn Detection in Spoken Dialogues by Detecting Speaker Intentions as a Secondary Task
Analogical Reasoning on Chinese Morphological and Semantic Relations
Convolutional Social Pooling for Vehicle Trajectory Prediction
R2-based hypervolume contribution approximation in multi-objective optimization
Emergence of Benford’s Law in Classical Music
Market Self-Learning of Signals, Impact and Optimal Trading: Invisible Hand Inference with Free Energy
The Crossing Number of Single-Pair-Seq-Shellable Drawings of Complete Graphs
Survival probability in Generalized Rosenzweig-Porter random matrix ensemble
Understanding Federation: An Analytical Framework for the Interoperability of Social Networking Sites
Semi-parametric Bayesian change-point model based on the Dirichlet process
A new convexity-based inequality, characterization of probability distributions and some free-of-distribution tests
QuaterNet: A Quaternion-based Recurrent Model for Human Motion
Valid and Approximately Valid Confidence Intervals for Current Status Data
Deconvolution of dust mixtures by latent Dirichlet allocation in forensic science
Utility maximization with proportional transaction costs under model uncertainty
QoE-Aware Beamforming Design for Massive MIMO Heterogeneous Networks
Remote Source Coding under Gaussian Noise : Dueling Roles of Power and Entropy Power
Composite Semantic Relation Classification
Semantic Relatedness for All (Languages): A Comparative Analysis of Multilingual Semantic Relatedness Using Machine Translation
Improving the Gaussian Mechanism for Differential Privacy: Analytical Calibration and Optimal Denoising
Beyond 5G with UAVs: Foundations of a 3D Wireless Cellular Network
Modeling Naive Psychology of Characters in Simple Commonsense Stories
Are BLEU and Meaning Representation in Opposition
Direct transcription methods based on fractional integral approximation formulas for solving nonlinear fractional optimal control problems
Dancing Pigs or Externalities Measuring the Rationality of Security Decisions
Joint Classification and Prediction CNN Framework for Automatic Sleep Stage Classification
Defoiling Foiled Image Captions
Graph-Based Resource Allocation with Conflict Avoidance for V2V Broadcast Communications
A Deep Ensemble Model with Slot Alignment for Sequence-to-Sequence Natural Language Generation
Extending a Parser to Distant Domains Using a Few Dozen Partially Annotated Examples
Exponential Integrators with Parallel-in-Time Rational Approximations for Climate and Weather Simulations
Recurrent Neural Network for Learning DenseDepth and Ego-Motion from Video
Extensions of Ramanujan’s reciprocity theorem and the Andrews–Askey integral
DeepGlobe 2018: A Challenge to Parse the Earth through Satellite Images
NPE: Neural Personalized Embedding for Collaborative Filtering
Content-based Popularity Prediction of Online Petitions Using a Deep Regression Model
Identification of the source of an interferer by comparison with known carriers using a single satellite
Gauss summation and Ramanujan type series for $1/π$
Caching With Time-Varying Popularity Profiles: A Learning-Theoretic Perspective
Some inequalities for Garvan’s bicrank function of 2-colored partitions
On the edge Szeged index of unicyclic graphs with given diameter
ADMM and Accelerated ADMM as Continuous Dynamical Systems
Classification of Coxeter groups with finitely many elements of $\mathbf{a}$-value 2
Cooperative Limited Feedback Design for Massive Machine-Type Communications
$W^{2,p}$-solutions of parabolic SPDEs in general domains
Deep Reinforcement Learning for Network Slicing
Cross-Target Stance Classification with Self-Attention Networks
Leveraging Social Signal to Improve Item Recommendation for Matrix Factorization
Covariance-Insured Screening
ARUM: Polar Coded HARQ Scheme based on Incremental Channel Polarization
Convolutional Attention Networks for Multimodal Emotion Recognition from Speech and Text Data
A Formulation of Recursive Self-Improvement and Its Possible Efficiency
Antenna Switching Sequence Design for Channel Sounding in a Fast Time-varying Channel
Optimization of Transfer Learning for Sign Language Recognition Targeting Mobile Platform
Taxi demand forecasting: A HEDGE based tessellation strategy for improved accuracy
Generative networks as inverse problems with Scattering transforms
Implementation of True Random Number Generator based on Double-Scroll Attractor circuit with GST memristor emulator
Structure-preserving Guided Retinal Image Filtering and Its Application for Optic Disc Analysis
Analysis of Noise in Current Mirrors with memristive Device
UAV-Aided 5G Communications with Deep Reinforcement Learning Against Jamming
Fast Entropy Estimation for Natural Sequences
Widlar Current Mirror Design Using BJT-Memristor Circuits
How to Dimension 5G Network When Users Are Distributed on Roads Modeled by Poisson Line Process
Independent Component Analysis via Energy-based and Kernel-based Mutual Dependence Measures
Testing for Conditional Mean Independence with Covariates through Martingale Difference Divergence
Joint direct estimation of 3D geometry and 3D motion using spatio temporal gradients
Memristor-based Approximation of Gaussian Filter
Performance Analysis and Optimization of Cooperative Full-Duplex D2D Communication Underlaying Cellular Networks
Implementation of Memristor in Bessel filter with RLC components
Spontaneous synchronization and nonequilibrium statistical mechanics of coupled phase oscillators
Extrapolation in NLP
Day-ahead electricity price forecasting with high-dimensional structures: Univariate vs. multivariate modeling frameworks
Deep-learning Based Modeling of Fault Detachment Stability for Power Grid
Single Shot Active Learning using Pseudo Annotators
Evolutionary RL for Container Loading
Classifying medical relations in clinical text via convolutional neural networks
LQ-optimal Sample-data Control under Stochastic Delays: Gridding Approach for Stabilizability and Detectability
Realizing Wireless Communication through Software-defined HyperSurface Environments
Detecting cyber threats through social network analysis: short survey
Analyzing order flows in limit order books with ratios of Cox-type intensities
Happy family of stable marriages
A Note on Polynomial Identity Testing for Depth-3 Circuits
Hierarchical Beamforming: Resource Allocation, Fairness and Flow Level Performance
Disentangling $α$ and $β$ relaxation in orientationally disordered crystals with theory and experiments
Fuzzy Membership Function Implementation with Memristor
Dual parameterization of Weighted Coloring
Data-Driven Nonlinear Identification of Li-Ion Battery Based on a Frequency Domain Nonparametric Analysis
Super Ricci flows for weighted graphs
Systematic encoders for generalized Gabidulin codes and the $q$-analogue of Cauchy matrices
Brownian Motions on Metric Graphs with Non-Local Boundary Conditions I: Characterization
Bounds for the smallest $k$-chromatic graphs of given girth
High-dimensional doubly robust tests for regression parameters
Density for solutions to stochastic differential equations with unbounded drift
Fréchet differentiable drift dependence of Perron–Frobenius and Koopman operators for non-deterministic dynamics
Exploiting the Superposition Property of Wireless Communication for Max-Consensus Problems in Multi-Agent Systems
A Distributed Algorithm for Finding Hamiltonian Cycles in Random Graphs in O(log n) Time
Data-Driven Chance Constrained Optimization under Wasserstein Ambiguity Sets
On a probabilistic Nyman-Beurling criterion for the Riemann hypothesis
A Robust Background Initialization Algorithm with Superpixel Motion Detection
An experiment-oriented analysis of 2D spin-glass dynamics: a twelve time-decades scaling study
Minimum Margin Loss for Deep Face Recognition
Action Completion: A Temporal Model for Moment Detection
Interpolatron: Interpolation or Extrapolation Schemes to Accelerate Optimization for Deep Neural Networks
Circularly Pulse-Shaped Precoding for OFDM: A New Waveform and Its Optimization Design for 5G New Radio
Situation Assessment for Planning Lane Changes: Combining Recurrent Models and Prediction
Faster Rates for Convex-Concave Games
Adaptive Discrete Second Order Sliding Mode Control with Application to Nonlinear Automotive Systems
Dependability in a Multi-tenant Multi-framework Deep Learning as-a-Service Platform
Supplier Cooperation in Drone Delivery
Counting Gallai 3-colorings of complete graphs
Pattern Recognition on Oriented Matroids: Symmetric Cycles in the Hypercube Graphs. III
Recursive parameter estimation in a Riemannian manifold
Annotating Electronic Medical Records for Question Answering
External memory BWT and LCP computation for sequence collections with applications
Learning Time-Sensitive Strategies in Space Fortress
On two consequences of Berge-Fulkerson conjecture
Disparity Sliding Window: Object Proposals From Disparity Images
An extension of the Plancherel measure
Subspace Estimation from Incomplete Observations: A High-Dimensional Analysis
Deleting edges to restrict the size of an epidemic in temporal networks
A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics
Methods for the inclusion of real world evidence in network meta-analysis
RotDCF: Decomposition of Convolutional Filters for Rotation-Equivariant Deep Networks
Quantitative structure of stable sets in finite abelian groups
Edge-statistics on large graphs
Mixed integer linear programming: a new approach for instrumental variable quantile regressions and related problems
Answer Set Programming Modulo `Space-Time’
Design Identification of Curve Patterns on Cultural Heritage Objects: Combining Template Matching and CNN-based Re-Ranking
Resource allocation under uncertainty: an algebraic and qualitative treatment
Optimal Scheduling and Exact Response Time Analysis for Multistage Jobs
Coding for Interactive Communication with Small Memory and Applications to Robust Circuits
NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning
It’s all Relative: Monocular 3D Human Pose Estimation from Weakly Supervised Data
Changing Observations in Epistemic Temporal Logic

Continue Reading…


Read More

inline 0.3.15

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A maintenance release of the inline package arrived on CRAN today. inline facilitates writing code in-line in simple string expressions or short files. The package is mature and in maintenance mode: Rcpp used it greatly for several years but then moved on to Rcpp Attributes so we have a much limited need for extensions to inline. But a number of other package have a hard dependence on it, so we do of course look after it as part of the open source social contract (which is a name I just made up, but you get the idea…)

This release was triggered by a (as usual very reasonable) CRAN request to update the per-package manual page which had become stale. We now use Rd macros, you can see the diff for just that file at GitHub; I also include it below. My pkgKitten package-creation helper uses the same scheme, I wholeheartedly recommend it — as the diff shows, it makes things a lot simpler.

Some other changes reflect both two user-contributed pull request, as well as standard minor package update issues. See below for a detailed list of changes extracted from the NEWS file.

Changes in inline version 0.3.15 (2018-05-18)

  • Correct requireNamespace() call thanks (Alexander Grueneberg in #5).

  • Small simplification to .travis.yml; also switch to https.

  • Use seq_along instead of seq(along=...) (Watal M. Iwasaki) in #6).

  • Update package manual page using references to DESCRIPTION file [CRAN request].

  • Minor packaging updates.

Courtesy of CRANberries, there is a comparison to the previous release.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Magister Dixit

“It’s impossible for the improbable to never occur.” Emil Gumbel

Continue Reading…


Read More

May 18, 2018

Document worth reading: “Sequence-Aware Recommender Systems”

Recommender systems are one of the most successful applications of data mining and machine learning technology in practice. Academic research in the field is historically often based on the matrix completion problem formulation, where for each user-item-pair only one interaction (e.g., a rating) is considered. In many application domains, however, multiple user-item interactions of different types can be recorded over time. And, a number of recent works have shown that this information can be used to build richer individual user models and to discover additional behavioral patterns that can be leveraged in the recommendation process. In this work we review existing works that consider information from such sequentially-ordered user- item interaction logs in the recommendation process. Based on this review, we propose a categorization of the corresponding recommendation tasks and goals, summarize existing algorithmic solutions, discuss methodological approaches when benchmarking what we call sequence-aware recommender systems, and outline open challenges in the area. Sequence-Aware Recommender Systems

Continue Reading…


Read More

Because it's Friday: Laurel or Yanny

I can only assume you've heard about this already: it's gone wildly viral in the USA, and I assume elsewhere in the world. But it's a lovely example of an auditory illusion, and as regular readers know I like to collect such things (like this one and this one).

Kottke has a useful roundup of the best scientific coverage of why some people hear "laurel" and others hear "yanny". Linguist Rachel Gutman also provides a detailed analysis for The Atlantic. For the record though, all I can hear is "laurel", and that indeed is what the actor said for a recording for

That's all from us for this week. Next week I'll be at the ROpenSci unconference in Seattle, but we'll back with more on the blog on Monday. In the meantime have a great weekend!

Continue Reading…


Read More

Distilled News

How to install R on Windows, Mac OS X and Ubuntu

I am a professor and I don’t like to spend one hour of the class or workshop installing R. To tackle that problem I polished these instructions over time by doing experiments with different machines in order to maximize the in-classroom experience. What you’ll find here is a collection of lines of code that work when installing R, and of course I tried a lot of things that didn’t work before obtaining the actual result. This tutorial covers the installation of R on Microsoft Windows, Mac OS X and Ubuntu Linux.

Advances in Semantic Textual Similarity

The recent rapid progress of neural network-based natural language understanding research, especially on learning semantic text representations, can enable truly novel products such as Smart Compose and Talk to Books. It can also help improve performance on a variety of natural language tasks which have limited amounts of training data, such as building strong text classifiers from as few as 100 labeled examples. Below, we discuss two papers reporting recent progress on semantic representation research at Google, as well as two new models available for download on TensorFlow Hub that we hope developers will use to build new and exciting applications.

Optimization Using R

Optimization is a technique for finding out the best possible solution for a given problem for all the possible solutions. Optimization uses a rigorous mathematical model to find out the most efficient solution to the given problem.

9 Must-have skills you need to become a Data Scientist

1. Education
2. R Programming
3. Python Coding
4. Hadoop Platform
5. SQL Database/Coding
6. Apache Spark
7. Machine Learning and AI
8. Data Visualization
9. Unstructured data
10. Intellectual curiosity
11. Business acumen
12. Communication skills
13. Teamwork

Kernel Machine Learning (KernelML) – Generalized Machine Learning Algorithm

This article introduces a pip Python package called KernelML, created to give analysts and data scientists a generalized machine learning algorithm for complex loss functions and non-linear coefficients.

Automated Feature Selection using bounceR

From a very philosophical point of view, as humans evolve we tend to automate repetitive tasks in order to waste our time with more pleasant matters. The same holds true for the field of data science as a whole, as much as for many tasks at STATWORX. What started as a super fancy and fun profession quickly became tedious work. When the first data science prophets faced their first projects, they realized that rather than coding fancy models the entire day, you are stuck cleaning data, building and selecting features, selecting algorithms and so on. As data scientists as a species become more and more evolved, no wonder they are trying to automate boring tasks. Working in data science for quite a while now, we see that trend as well. Not only because of companies like, the automation of data science is progressing with an incredible pace.

Continue Reading…


Read More

If you did not already know

Distributional Adversarial Networks google
We propose a framework for adversarial training that relies on a sample rather than a single sample point as the fundamental unit of discrimination. Inspired by discrepancy measures and two-sample tests between probability distributions, we propose two such distributional adversaries that operate and predict on samples, and show how they can be easily implemented on top of existing models. Various experimental results show that generators trained with our distributional adversaries are much more stable and are remarkably less prone to mode collapse than traditional models trained with pointwise prediction discriminators. The application of our framework to domain adaptation also results in considerable improvement over recent state-of-the-art. …

Online Generative Discriminative Restricted Boltzmann Machine (OGD-RBM) google
We propose a novel online learning algorithm for Restricted Boltzmann Machines (RBM), namely, the Online Generative Discriminative Restricted Boltzmann Machine (OGD-RBM), that provides the ability to build and adapt the network architecture of RBM according to the statistics of streaming data. The OGD-RBM is trained in two phases: (1) an online generative phase for unsupervised feature representation at the hidden layer and (2) a discriminative phase for classification. The online generative training begins with zero neurons in the hidden layer, adds and updates the neurons to adapt to statistics of streaming data in a single pass unsupervised manner, resulting in a feature representation best suited to the data. The discriminative phase is based on stochastic gradient descent and associates the represented features to the class labels. We demonstrate the OGD-RBM on a set of multi-category and binary classification problems for data sets having varying degrees of class-imbalance. We first apply the OGD-RBM algorithm on the multi-class MNIST dataset to characterize the network evolution. We demonstrate that the online generative phase converges to a stable, concise network architecture, wherein individual neurons are inherently discriminative to the class labels despite unsupervised training. We then benchmark OGD-RBM performance to other machine learning, neural network and ClassRBM techniques for credit scoring applications using 3 public non-stationary two-class credit datasets with varying degrees of class-imbalance. We report that OGD-RBM improves accuracy by 2.5-3% over batch learning techniques while requiring at least 24%-70% fewer neurons and fewer training samples. This online generative training approach can be extended greedily to multiple layers for training Deep Belief Networks in non-stationary data mining applications without the need for a priori fixed architectures. …

Nonlinear Variable Selection based on Derivatives (NVSD) google
We investigate structured sparsity methods for variable selection in regression problems where the target depends nonlinearly on the inputs. We focus on general nonlinear functions not limiting a priori the function space to additive models. We propose two new regularizers based on partial derivatives as nonlinear equivalents of group lasso and elastic net. We formulate the problem within the framework of learning in reproducing kernel Hilbert spaces and show how the variational problem can be reformulated into a more practical finite dimensional equivalent. We develop a new algorithm derived from the ADMM principles that relies solely on closed forms of the proximal operators. We explore the empirical properties of our new algorithm for Nonlinear Variable Selection based on Derivatives (NVSD) on a set of experiments and confirm favourable properties of our structured-sparsity models and the algorithm in terms of both prediction and variable selection accuracy. …

Continue Reading…


Read More

What’s in a food truck

Food trucks are the real deal these days. The best ones serve a specialized menu really well, in a small, focused space. The Washington Post delves into the insides of several of these trucks and how they make the food with very specific equipment.

Tags: , ,

Continue Reading…


Read More

Pursue a Stanford Data Science Certificate

With our online graduate courses and certificates, you can earn a higher education credential from Stanford while still maintaining your career.

Continue Reading…


Read More

Science and Technology links (May 18th, 2018)

  1. How is memory encoded in your brain? If you are like me, you assume that it is encoded in the manner in which your brain cells are connected together. Strong and weak connections between brain cells create memories. Some people think that it is not how memories are encoded.

    To prove that it is otherwise, scientists have recently transferred memories between snails by injections of small molecules taken from a trained snail. Maybe one day you could receive new memories through an injection. If true, this result is probably worth a Nobel prize. It is probably not true.

  2. Inflammation is a crude and undiscerning immune response that your body uses when it has nothing better to offer. One of the reasons aspirin is so useful is that it tends to reduce inflammation. There are many autoimmune diseases that can be described as “uncontrolled inflammation”. For example, many people suffer from psoriasis: their skin peels off and becomes sensitive. Richards et al. believe that most neurodegenerative diseases (such as Alzheimer’s, Parkinson’s, ALS) are of a similar nature:

    it is now clear that inflammation is (…) likely, a principal underlying cause in most if not all neurodegenerative diseases

  3. Scientists are sounding the alarm about the genetic tinkering carried out in garages and living rooms.
  4. The more intelligent you are, the less connected your brain cells are:

    (…)individuals with higher intelligence are more likely to have larger gray matter volume (…) intelligent individuals, despite their larger brains, tend to exhibit lower rates of brain activity during reasoning (…) higher intelligence in healthy individuals is related to lower values of dendritic density and arborization. These results suggest that the neuronal circuitry associated with higher intelligence is organized in a sparse and efficient manner, fostering more directed information processing and less cortical activity during reasoning.

  5. It is known that alcohol consumption has a protective effect on your heart. What about people who drink too much? A recent study found that patients with a troublesome alcohol history have a significantly lower prevalence of cardiovascular disease events, even after adjusting for demographic and traditional risk factors. Please note that it does not imply that drinking alcohol will result in a healthier or longer life.
  6. A third of us have high blood pressure. And most of us are not treated for it.
  7. Eating lots of eggs every day is safe. Don’t be scared of their cholesterol. (Credit: Mangan.)
  8. According to data collected by NASA, global temperatures have fallen for the last two years. This is probably due to the El Nino effect that caused record temperatures two years ago. What is interesting to me is that these low global temperatures get no mention at all in the press whereas a single high temperature record (like what happened two years ago) gets the front page.

    That’s a problem in my opinion. You might think that by pushing aside data that could be misinterpreted, you are protecting the public. I don’t think it works that way. People are less stupid and more organized than you might think. They will find the data, they will talk about themselves, and they will lose confidence in you (rightly so). The press and the governments should report that the temperatures are decreasing… and then explain why it does not mean that the Earth is not warming anymore.

    The Earth is definitively getting warmer, at a rate of about 0.15 degrees per decade. You best bet is to report the facts:

  9. Low-carbohydrate, high-fat diets might prevent cancer progression.
  10. Participating in the recommended 150 minutes of moderate to vigorous activity each week, such as brisk walking or biking, in middle age may be enough to reduce your heart failure risk by 31 percent. (There is no strong evidence currently that people who exercise live longer. It does seem that they are more fit, however.)

Continue Reading…


Read More

Microsoft Weekly Data Science News for May 18, 2018

Here are the latest articles from Microsoft regarding cloud data science products and updates.

  • Azure Content Spotlight – What’s New with Cognitive ServicesThis weeks content spotlight is all about Azure Cognitive Services. Seth Juarez’s AI Show on Channel 9 provides regular updates on all the new AI features on the Azure platform, including Cognitive Services. See below a collection of the latest video’s …[Read More]
  • A Scalable End-to-End Anomaly Detection System using Azure Batch AIThis post is authored by Said Bleik, Senior Data Scientist at Microsoft. In a previous post I showed how Batch AI can be used to train many anomaly detection models in parallel for IoT scenarios … several Azure cloud services and Python code that …[Read More]
  • Azure.Source – Volume 31In addition, Cognitive Services add pre-built, cloud-hosted APIs for developers to add AI capabilities, including new services announced at Build. This post also covers Cognitive Search and Azure Machine Learning (ML) advancements. The Microsoft data …[Read More]
  • Azure Stack: the last mile in Hybrid CloudThese include Microsoft Azure Cognitive Services, exceptionally large HDInsight environments, and Microsoft Azure Data Lake Store. Services which are best consumed in a Hyperscale Cloud will run on Azure, while services that best fit an enterprise …[Read More]
  • Using Azure for Machine LearningI’m interested in learning more about AI, Data Science, and Machine Learning to improve … other interesting and useful products such as Microsoft IoT Hub, SQL Database, and Cognitive Services which I use a lot for Pantrylogs. You can really play …[Read More]
  • Use AU Analyzer for faster, lower cost Data Lake AnalyticsDo you use Data Lake Analytics and wonder how many Analytics Units your jobs should have been assigned? Do you want to see if your job could consume a little less time or money? The recently-announced AU Analyzer tool can help you today! See our recent …[Read More]
  • Simple and robust way to operationalise Spark models on AzureIt gives you everything that Open Source Spark does and then some. I’ve been especially enjoying the effortless ways to move large datasets around and the ease of MLlib for my AI-projects. One of the questions with the simpler models like regressions and …[Read More]
  • New AI Services in Azure for students and academics announced at Build 20181.Object Detection update to custom vision (preview) 2.Video Indexer (Paid Preview) 1.Bot Builder SDK v4 (preview) Bot Builder homepage or the Bot Builder …[Read More]
  • How Azure IoT helped me buy a new house – Part 1shares a personal story on how he used Azure IoT to figure out a solution to a problem that many of us face – high electric bills. In the series, Steve shares the process and code that he used to implement this solution. Telemetry data is an important …[Read More]


Continue Reading…


Read More

Kernel Machine Learning (KernelML) - Generalized Machine Learning Algorithm

This article introduces a pip Python package called KernelML, created to give analysts and data scientists a generalized machine learning algorithm for complex loss functions and non-linear coefficients.

Continue Reading…


Read More

YouTube videos on database management, SQL, Datawarehousing, Business Intelligence, OLAP, Big Data, NoSQL databases, data quality, data governance and Analytics – free

Watch over 20 hours of YouTube videos on databases and database design, Physical Data Storage, Transaction Management and Database Access, and Data Warehousing, Data Governance and (Big) Data Analytics - all free.

Continue Reading…


Read More

What Makes a Song (More) Popular

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

Earlier this week, the Association for Psychological Science sent out a press release about a study examining what makes a song popular:

Researchers Jonah Berger of the University of Pennsylvania and Grant Packard of Wilfrid Laurier University were interested in understanding the relationship between similarity and success. In a recent study published in Psychological Science, the authors describe how a person’s drive for stimulation can be satisfied by novelty. Cultural items that are atypical, therefore, may be more liked and become more popular.

“Although some researchers have argued that cultural success is impossible to predict,” they explain, “textual analysis of thousands of songs suggests that those whose lyrics are more differentiated from their genres are more popular.”

The study, which is was published online ahead of print, used a method of topic modeling called latent Dirichlet allocation. (Side note, this analysis is available in the R topicmodels package, as function LDA. It requires a document term matrix, which can be created in R. Perhaps a future post!) The LDA extracted 10 topics from the lyrics of songs spanning seven genres (Christian, country, dance, pop, rap, rock, and rhythm and blues):

  • Anger and violence
  • Body movement
  • Dance moves
  • Family
  • Fiery love
  • Girls and cars
  • Positivity
  • Spiritual
  • Street cred
  • Uncertain love

Overall, they found that songs with lyrics that differentiated them from other songs in their genre were more popular. However, this wasn’t the case for the specific genres of pop and dance, where lyrical differentiation appeared to be harmful to popularity. Finally, being lyrically different by being more similar to a different genre (a genre to which the song wasn’t defined) had no impact. So it isn’t about writing a rock song that sounds like a rap song to gain popularity; it’s about writing a rock song that sounds different from other rock songs.

I love this study idea, especially since I’ve started doing some text and lyric analysis on my own. (Look for another one Sunday, tackling the concept of sentiment analysis!) But I do have a criticism. This research used songs listed in the Billboard Top 50 by genre. While it would be impossible to analyze every single song that comes out a given time, this study doesn’t really answer the question of what makes a song popular, but what determines how popular an already popular song is. The advice in the press release (To Climb the Charts, Write Lyrics That Stand Out), may be true for established artists who are already popular, but it doesn’t help that young artist trying to break onto the scene. They’re probably already writing lyrics to try to stand out. They just haven’t been noticed yet.

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Bacon Bytes for 18-May

Bacon Bytes


Welcome to the latest edition of Bacon Bytes. This week we talk a bit about how Amazon, Alexa, and why you shouldn’t photocopy your behind at office Christmas parties.


Alexa and Siri Can Hear This Hidden Command. You Can’t.

As if we didn’t have enough to worry about when it comes to Alexa commands and security, now we have to worry about dog-whistle commands being sent to our devices. Fun fact uncovered in this report: It is not illegal to send subliminal messages to another person, or device. Broadcast networks discourage the practice, but there is no law on the books. Yet another example of how technology moves too fast for lawmakers. Well, in this case, it’s only been 70+ years for television. Makes me wonder what the networks have been doing that we don’t know about. Time to watch “They Live” again, I guess.


EFAIL describes vulnerabilities in the end-to-end encryption technologies OpenPGP and S/MIM

Nice write up describing the flaw in email encryption along with possible mitigations for you to try right now. Here’s another idea: stop thinking that anything you do online is protected. When you send an email, a copy of that email is likely sitting on every server involved in routing the message. Encrypted or not, that’s a lot more exposure than you wanted to know about today.


Digital Photocopiers Loaded With Secrets

And with pictures from the office Christmas party, I’m certain.


You can find whereabouts of any cellphone within seconds

There are many legitimate cases where you would want law enforcement to be able to track a person to a specific location through their cellphone. The trouble with this service is that it can be abused by law enforcement officials, as is the case in this story. But it’s a good reminder that no matter how smart you are with your smartphone privacy, it is likely reporting your location in ways you are not allowed to disable.


Amazon has finished visiting the top 20 contenders for its new HQ

Worst episode of The Bachelor, ever.


Amazon Prime customers to get discounts at Whole Foods

Looking to justify their $20 annual membership fee, Amazon is adding services and benefits for Prime members. I’d be excited for this perk if there is a 10% discount on meat but I suspect it’s going to be on items that aren’t selling and nobody wants.


The Entire Economy Is MoviePass Now. Enjoy It While You Can

This article outlines the business model for 98% of every idea that comes out of Silicon Valley. In an effort to build up as many customers they give away their product at a loss. Once they show how many new customers are signing up, they get more funding. They then attract more customers, allowing them to collect more data, which they can then sell in order to recover some of their losses. Then, they alter their product offerings, due to “rising costs”, and hope they get more funding, or just bought.

See you next week!

The post Bacon Bytes for 18-May appeared first on Thomas LaRock.

Continue Reading…


Read More

How heavy use of social media is linked to mental illness

MAY 20th will mark the end of “mental-health awareness week”, a campaign run by the Mental Health Foundation, a British charity. Roughly a quarter of British adults have been diagnosed at some point with a psychiatric disorder, costing the economy an estimated 4.5% of GDP per year.

Continue Reading…


Read More

How To Plot With Dygraphs: Exercises

(This article was first published on R-exercises, and kindly contributed to R-bloggers)


The dygraphs package is an R interface to the dygraphs JavaScript charting library. It provides rich facilities for charting time-series data in R, including:

1. Automatically plots xts time-series objects (or any object convertible to xts.)

2. Highly configurable axis and series display (including optional second Y-axis.)

3. Rich interactive features, including zoom/pan and series/point highlighting.

4. Display upper/lower bars (ex. prediction intervals) around the series.

5. Various graph overlays, including shaded regions, event lines, and point annotations.

6. Use at the R console, just like conventional R plots (via RStudio Viewer.)

7. Seamless embedding within R Markdown documents and Shiny web applications.

Before proceeding, please follow our short tutorial.

Look at the examples given and try to understand the logic behind them. Then, try to solve the exercises below by using R without looking at the answers. Then, check the solutions to check your answers.

Exercise 1

Unite the two time series data-sets mdeaths and fdeaths and create a time-series dygraph of the new data-set.

Exercise 2

Insert a date range selector into the dygraph you just created.

Exercise 3

Change the label names of “mdeaths” and “fdeaths” to “Male” and “Female.”

Exercise 4

Make the graph stacked.

Exercise 5

Set the date range selector height to 20.

Exercise 6

Add a main title to your graph.

Exercise 7

Use the tutorial’s predicted data-set to create a dygraph of “lwr”, “fit”, and “upr”, but display the label as the summary of them.

Exercise 8

Set the colors to red.

Exercise 9

Remove the x-axis grid lines from your graph.

Exercise 10

Remove the y-axis grid lines from your graph.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Optimization Using R

Optimization is a technique for finding out the best possible solution for a given problem for all the possible solutions. Optimization uses a rigorous mathematical model to find out the most efficient solution to the given problem.

Continue Reading…


Read More

Awesome data visualization tool for brain research

When I was visiting the University of Washington the other day, Ariel Rokem showed me this cool data visualization and exploration tool produced by Jason Yeatman, Adam Richie-Halford, Josh Smith, and himself. The above image gives a sense of the dashboard but the real thing is much more impressive because it’s interactive. You can rotate that brain image.

And here’s a research paper describing what they did.

The post Awesome data visualization tool for brain research appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

When Comparing Blockchains, Decentralization Comes in Degrees

Word clouds are an easy way to see which topics within any subject are most prominent. Generate a word cloud from the Reddit cryptocurrency forum or an array of blockchain whitepapers, and one will quickly see that a single term – “decentralized” – is the biggest and boldest. It’s no

The post When Comparing Blockchains, Decentralization Comes in Degrees appeared first on Dataconomy.

Continue Reading…


Read More

‘LMX ot NOSJ!’ Interchanging Classic Data Formats With Single `blackmagic` Incantations

(This article was first published on R –, and kindly contributed to R-bloggers)

The D.C. Universe magic hero Zatanna used spells (i.e. incantations) to battle foes and said spells were just sentences said backwards, hence the mixed up jumble in the title. But, now I’m regretting not naming the package zatanna and reversing the function names to help ensure they’re only used deliberately & carefully. You’ll see why in a bit.

Just like their ore-seeking speleological counterparts, workers in our modern day data mines process a multitude of mineralic data formats to achieve our goals of world domination finding meaning, insight & solutions to hard problems.

Two formats in particular are common occurrences in many of our $DAYJOBs: XML and JSON. The rest of this (hopefully short-ish) post is going to assume you have at least a passing familiarity with — if not full-on battle scars from working with — them.

XML and JSON are, in many ways, very similar. This similarity is on purpose since JSON was originally created to make is easier to process data in browsers and help make data more human-readable. If your $DAYJOB involves processing small or large streams of nested data, you likely prefer JSON over XML.

There are times, though, that — even if one generally works with only JSON data — one comes across a need to ingest some XML and turn it into JSON. This was the case for a question-poser on Stack Overflow this week (I won’t point-shill with a direct link but it’ll be easy to find if you are interested in this latest SODD package).

Rather than take on the potentially painful task of performing the XML to JSON transformation on their own the OP wished for a simple incantation to transform the entirety of the incoming XML into JSON.

We’ll switch comic universes for a moment to issue a warning that all magic comes with a price. And, the cost for automatic XML<->JSON conversion can be quite high. XML has namespaces, attributes tags and values and requires schemas to convey data types and help validate content structure. JSON has no attributes, implicitly conveys types and is generally schema-less (though folks have bolted on that concept).

If one is going to use magic for automatic data conversion there must be rules (no, not those kind of Magic rules), otherwise how various aspects of XML become encoded into JSON (and the reverse) will generate inconsistency and may even result in significant data corruption. Generally speaking, you are always better off writing your own conversion utility vs rely on specific settings in a general conversion script/function. However, if your need is a one-off (which anyone who has been doing this type of work for a while knows is also generally never the case) you may have cause to throw caution to the wind, get your quick data fix, and move on. If that is the case, the blackmagic🔗 package may be of use to you.

gnitrevnoC eht ANAI sserddA ecapS yrtsigeR ot NOSJ

One file that’s in XML that I only occasionally have to process is the IANA IPv4 Address Space Registry. If you visited that link you may have uttered “Hey! That’s not XML it’s HTML!”, to wit — I would respond — “Well, HTML is really XML anyway, but use the View Source, Luke! and see that it is indeed XML with some clever XSL style sheet processing being applied in-browser to make the gosh awful XML human readable.”.

With blackmagic we can make quick work of converting this monstrosity into JSON.

The blackmagic package itself uses even darker magic to accomplish its goals. The package is just a thin V8 wrapper around the xml-js🔗 javascript library. Because of this, it is recommended that you do not try to process gigabytes of XML with it as there is a round trip of data marshalling between R and the embedded v8 engine.

requireNamespace("jsonlite") # jsonlite::flatten clobbers purrr::flatten in the wrong order so I generally fully-qualify what I need
## Loading required namespace: jsonlite
library(blackmagic) # devtools::install_github("hrbrmstr/blackmagic")
requireNamespace("dplyr") # I'm going to fully qualify use of dplyr:data_frame() below
## Loading required namespace: dplyr

You can thank @yoniceedee for the URL processing capability in blackmagic:

source_url <- ""

iana_json <- blackmagic::xml_to_json(source_url)

# NOTE: cat the whole iana_json locally to see it — perhaps to file="..." vs clutter your console
cat(substr(iana_json, 1800, 2300))
## me":"prefix","elements":[{"type":"text","text":"000/8"}]},{"type":"element","name":"designation","elements":[{"type":"text","text":"IANA - Local Identification"}]},{"type":"element","name":"date","elements":[{"type":"text","text":"1981-09"}]},{"type":"element","name":"status","elements":[{"type":"text","text":"RESERVED"}]},{"type":"element","name":"xref","attributes":{"type":"note","data":"2"}}]},{"type":"element","name":"record","elements":[{"type":"element","name":"prefix","elements":[{"type":"

By by the hoary hosts of Hoggoth that's not very "human readable"! And, it looks super-verbose. Thankfully, Yousuf Almarzooqi knew we'd want to fine-tune the output and we can use those options to make this a bit better:

  doc = source_url, 
  spaces = 2,                # Number of spaces to be used for indenting XML output
  compact = FALSE,           # Whether to produce detailed object or compact object
  ignoreDeclaration = TRUE   # No declaration property will be generated.
) -> iana_json

# NOTE: cat the whole iana_json locally to see it — perhaps to file="..." vs clutter your console
cat(substr(iana_json, 3000, 3300))
## pe": "element",
##               "name": "prefix",
##               "elements": [
##                 {
##                   "type": "text",
##                   "text": "000/8"
##                 }
##               ]
##             },
##             {
##               "type": "element",
##               "name": "designation",

One "plus side" for doing the mass-conversion is that we don't really need to do much processing to have it be "usable" data in R:

  doc = source_url, 
  compact = FALSE,        
  ignoreDeclaration = TRUE
) -> iana_json

# NOTE: consider taking some more time to explore this monstrosity than this
str(processed <- jsonlite::fromJSON(iana_json), 3)
## List of 1
##  $ elements:'data.frame':    3 obs. of  5 variables:
##   ..$ type       : chr [1:3] "instruction" "instruction" "element"
##   ..$ name       : chr [1:3] "xml-stylesheet" "oxygen" "registry"
##   ..$ instruction: chr [1:3] "type=\"text/xsl\" href=\"ipv4-address-space.xsl\"" "RNGSchema=\"ipv4-address-space.rng\" type=\"xml\"" NA
##   ..$ attributes :'data.frame':  3 obs. of  2 variables:
##   .. ..$ xmlns: chr [1:3] NA NA ""
##   .. ..$ id   : chr [1:3] NA NA "ipv4-address-space"
##   ..$ elements   :List of 3
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ :'data.frame':  280 obs. of  4 variables:

compact(processed$elements$elements[[3]]$elements) %>% 
  head(6) %>% 
## List of 6
##  $ :'data.frame':    1 obs. of  2 variables:
##   ..$ type: chr "text"
##   ..$ text: chr "IANA IPv4 Address Space Registry"
##  $ :'data.frame':    1 obs. of  2 variables:
##   ..$ type: chr "text"
##   ..$ text: chr "Internet Protocol version 4 (IPv4) Address Space"
##  $ :'data.frame':    1 obs. of  2 variables:
##   ..$ type: chr "text"
##   ..$ text: chr "2018-04-23"
##  $ :'data.frame':    3 obs. of  4 variables:
##   ..$ type      : chr [1:3] "text" "element" "text"
##   ..$ text      : chr [1:3] "Allocations to RIRs are made in line with the Global Policy published at " NA ". \nAll other assignments require IETF Review."
##   ..$ name      : chr [1:3] NA "xref" NA
##   ..$ attributes:'data.frame':   3 obs. of  2 variables:
##   .. ..$ type: chr [1:3] NA "uri" NA
##   .. ..$ data: chr [1:3] NA "" NA
##  $ :'data.frame':    3 obs. of  4 variables:
##   ..$ type      : chr [1:3] "text" "element" "text"
##   ..$ text      : chr [1:3] "The allocation of Internet Protocol version 4 (IPv4) address space to various registries is listed\nhere. Origi"| __truncated__ NA " documents most of these allocations."
##   ..$ name      : chr [1:3] NA "xref" NA
##   ..$ attributes:'data.frame':   3 obs. of  2 variables:
##   .. ..$ type: chr [1:3] NA "rfc" NA
##   .. ..$ data: chr [1:3] NA "rfc1466" NA
##  $ :'data.frame':    5 obs. of  4 variables:
##   ..$ type      : chr [1:5] "element" "element" "element" "element" ...
##   ..$ name      : chr [1:5] "prefix" "designation" "date" "status" ...
##   ..$ elements  :List of 5
##   .. ..$ :'data.frame':  1 obs. of  2 variables:
##   .. ..$ :'data.frame':  1 obs. of  2 variables:
##   .. ..$ :'data.frame':  1 obs. of  2 variables:
##   .. ..$ :'data.frame':  1 obs. of  2 variables:
##   .. ..$ : NULL
##   ..$ attributes:'data.frame':   5 obs. of  2 variables:
##   .. ..$ type: chr [1:5] NA NA NA NA ...
##   .. ..$ data: chr [1:5] NA NA NA NA ...

As noted previously, all magic comes with a price and we just traded XML processing for some gnarly list processing. This isn't the case for all XML files and you can try to tweak the parameters to xml_to_json() to make the output more usable (NOTE: key name transformation parameters still need to be implemented in the package), but this seems a whole lot easier (to me):

doc <- read_xml(source_url)


  prefix = xml_find_all(doc, ".//record/prefix") %>% xml_text(),
  designation = xml_find_all(doc, ".//record/designation") %>% xml_text(),
  date = xml_find_all(doc, ".//record/date") %>% 
    xml_text() %>% 
    sprintf("%s-01", .) %>% 
  whois = xml_find_all(doc, ".//record") %>% 
    map(xml_find_first, "./whois") %>% 
  status = xml_find_all(doc, ".//record/status") %>% xml_text()
## # A tibble: 256 x 5
##    prefix designation                     date       whois        status  
##  1 000/8  IANA - Local Identification     1981-09-01          RESERVED
##  2 001/8  APNIC                           2010-01-01 whois.apnic… ALLOCAT…
##  3 002/8  RIPE NCC                        2009-09-01 whois.ripe.… ALLOCAT…
##  4 003/8  Administered by ARIN            1994-05-01 whois.arin.… LEGACY  
##  5 004/8  Level 3 Parent, LLC             1992-12-01 whois.arin.… LEGACY  
##  6 005/8  RIPE NCC                        2010-11-01 whois.ripe.… ALLOCAT…
##  7 006/8  Army Information Systems Center 1994-02-01 whois.arin.… LEGACY  
##  8 007/8  Administered by ARIN            1995-04-01 whois.arin.… LEGACY  
##  9 008/8  Administered by ARIN            1992-12-01 whois.arin.… LEGACY  
## 10 009/8  Administered by ARIN            1992-08-01 whois.arin.… LEGACY  
## # ... with 246 more rows


xml_to_json() has a sibling function --- json_to_xml() for the reverse operation and you're invited to fill in the missing parameters with a PR as there is a fairly consistent and straightforward way to do that. Note that a small parameter tweak can radically change the output, which is one of the aforementioned potentially costly pitfalls of this automagic conversion.

Before using either function, seriously consider taking the time to write a dedicated, small package that exposes a function or two to perform the necessary conversions.

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

R Improvements for Bio7 2.8

(This article was first published on R – Bio7 Website, and kindly contributed to R-bloggers)


The next release of Bio7 adds a lot of new R features and improvements. One minor change is that the default perspective after the startup of Bio7 now is the R perspective to emphazise the importance of R within this software.

The R-Shell view has been simplified and the R templates have been moved in it’s own simple view for an improved usability (see screenshot from R perspective below).

In addition the context menu has been enhanced to allow the creation of submenus from scripts found in folders and subfolders (recursively added) which you can create for a menu structure.
Scripts can be added created in R, JavaScript, Groovy, Jython, BeanShell, ImageJ Macros.
Java (with dependant classes) can be dynamically compiled and executed like a script, too.

Several improvements have also been added to the R-Shell and the R editor for an easier generation of valid R code. The R-Shell and the R editor now display R workspace objects with it’s class and structure in the code completion dialog (marked with a new workspace icon – see below).


R editor:

In the R editor a new quick fix function has been added to detect and install missing packages (from scanned default packages folder of an R installation – has to be enabled in the Bio7 R code analysis preferences).

Also the detection of missing package imports are fixable (when a function is called but the installed package declaration is missing in the code but the package is installed to deliver the function).

The code assistance in the R-Shell and in the R editor now offers completions for, e.g., dataframes (columns) in the %>% operator of piped function calls:

In addition code assistance is available for list, vectors, dataframes and arrays of named rows and columns, etc., when available in the current R environment.

Code completion for package functions can now easily added with the R-Shell or the R editor which loads the package function help for both interfaces. The editor will automatically be updated (see updated editor marking unknown functions in screencast below).

Numerous other features, improvements and bugfixes have been added, too.

Bio7 2.8 will hopefully be available soon at:

Overview videos on YouTube



To leave a comment for the author, please follow the link and comment on their blog: R – Bio7 Website. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Four short links: 18 May 2018

Efficient Meetings, Mixed Reality in Unity, Design Power, and AI's Exponential Curve of Cost

  1. Reaching Peak Meeting Efficiency -- solid advice for business meetings, including a taxonomy with some firm opinions.
  2. United Mixed Reality Toolkit -- a collection of scripts and components intended to accelerate development of applications targeting Microsoft HoloLens and Windows Mixed Reality headsets in Unity. See blog post.
  3. Reddit's New Design Increases Power Consumption by 68GW/Month -- this Reddit user shows their working.
  4. AI and Compute (OpenAI) -- since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.5 month-doubling time (by comparison, Moore’s Law had an 18-month doubling period).

Continue reading Four short links: 18 May 2018.

Continue Reading…


Read More

Automated Feature Selection using bounceR

(This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

Automated Data Science

From a very philosophical point of view, as humans evolve we tend to automate repetitive tasks in order to waste our time with more pleasant matters. The same holds true for the field of data science as a whole, as much as for many tasks at STATWORX. What started as a super fancy and fun profession quickly became tedious work. When the first data science prophets faced their first projects, they realized that rather than coding fancy models the entire day, you are stuck cleaning data, building and selecting features, selecting algorithms and so on. As data scientists as a species become more and more evolved, no wonder they are trying to automate boring tasks. Working in data science for quite a while now, we see that trend as well. Not only because of companies like, the automation of data science is progressing with an incredible pace.

Automation at STATWORX

Most Data Scientists are doing what they do out of passion for the entire idea of gathering information from messy data. The same applies to us. Thus, when we are not busy working projects, we are reading up on the latest developments of the industry. This inspires us to progress automating certain tasks that we find to be extremely time consuming every single time. For once, there is data prep.. everyone who had to clean up and merge data to make it somewhat interpretable to a machine knows what I am talking about. The next task is feature engineering.. man that takes time.. you know what I mean. Building all kinds of interaction terms, mathematical transformations by hand is quite an effort. The next step, the selection of relevant features is extremely time consuming. Besides taking a lot of time, your feature selection procedure is one of the most important aspects in a solid anti-overfitting campaign. The next task is of course the selection of a feasible algorithm. This again, is very tedious of course and will probably make it into one of our other articles. Though this one is devoted to our approach of solving feature selection in a fully automated fashion.

Feature Selection at STATWORX

Most of times when we face new data, we are lets say "charmingly uninformed" about the actual meaning of the data. Thus, we talk to our business partners (they are the ones with the expertise), visualize information, and find statistical relationships to somehow make sense of the data. However, this knowledge often does not suffice to select all relevant features to solve a forecasting or prediction problem. Thus, we developed an automated way to help us solve our feature selection issues. Our selection approach relies on the cleverness of componentwise boosting and the genius learning procedure of backpropagation. We put everything together in a nice little R package, so that the community can challenge our approach. Sure, it is not the only way to select features, and sure, it is probably not the one solution to select them all. However, for all our use cases in which we have a lot of features and little obersvations, it is working exceedingly well. In fact, compared to other selection criteria, we can see that our algorithm is much better at selecting relevant features in a controlled simulation environment, than methods such as correlation-covariance filters, maximum relevance – minimum redundancy filters, random forest, penalized linear models, etc. We are currently working on a more generalizable simulation study – so stay tuned and check this blog from time to time, cause I will be getting back to this.

bounceR logo

bounceR for real now!

Before I start talking about all the stuff we are going to do, I'd rather show you, what we did so far. The algorithm is quite simplistic really. By the way, I gave a talk on this lately, so you can check that out on youtube. So, how does the algorithm work? I am putting some pseudo code below, so you guys can check it out.

bouncer algorithm

Looks legit, right. In principle, what it does, is to split the feature space into small little chunks of features with bootstraped observations. And it does so very often to cover as many combinations as possible. Then, it evaluates the subsets and selects the most relevant features in every subset. The outcome of each subset is then aggregated to a global distribution. What we are essentially interested in is this aggregated distribution. So basically we ask the question: If we simulate little datasets with randomly drawn features and bootstraped observations, which features will survive in this setting? Features that survive many of these little simulations are prone to serve in the final model. If you want to have a close look at the code, you should check out our GitHub repo.


A fully automated feature selection, of course, is just one module in our stack of automated data science tools. Writing about automated data science and about automating my job, I cannot help but wonder about my job security. So, if you are looking for someone with the brightness to make his or her own job obsolete, give me a call… 😉

Über den Autor

Lukas Strömsdörfer

Lukas Strömsdörfer

Lukas ist im Data Science Team und promoviert gerade extern an der Uni Göttingen. In seiner Freizeit fährt er leidenschaftlich gerne Fahrrad und schaut Serien.

Der Beitrag Automated Feature Selection using bounceR erschien zuerst auf STATWORX.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

If you did not already know

Distance Metric Learning (DML) google
Distance metric learning (DML), which learns a distance metric from labeled ‘similar’ and ‘dissimilar’ data pairs, is widely utilized. Recently, several works investigate orthogonality-promoting regularization (OPR), which encourages the projection vectors in DML to be close to being orthogonal, to achieve three effects: (1) high balancedness — achieving comparable performance on both frequent and infrequent classes; (2) high compactness — using a small number of projection vectors to achieve a ‘good’ metric; (3) good generalizability — alleviating overfitting to training data. While showing promising results, these approaches suffer three problems. First, they involve solving non-convex optimization problems where achieving the global optimal is NP-hard. Second, it lacks a theoretical understanding why OPR can lead to balancedness. Third, the current generalization error analysis of OPR is not directly on the regularizer. In this paper, we address these three issues by (1) seeking convex relaxations of the original nonconvex problems so that the global optimal is guaranteed to be achievable; (2) providing a formal analysis on OPR’s capability of promoting balancedness; (3) providing a theoretical analysis that directly reveals the relationship between OPR and generalization performance. Experiments on various datasets demonstrate that our convex methods are more effective in promoting balancedness, compactness, and generalization, and are computationally more efficient, compared with the nonconvex methods. …

Deep Learning Library (DLL) google
Deep Learning Library (DLL) is a new library for machine learning with deep neural networks that focuses on speed. It supports feed-forward neural networks such as fully-connected Artificial Neural Networks (ANNs) and Convolutional Neural Networks (CNNs). It also has very comprehensive support for Restricted Boltzmann Machines (RBMs) and Convolutional RBMs. Our main motivation for this work was to propose and evaluate novel software engineering strategies with potential to accelerate runtime for training and inference. Such strategies are mostly independent of the underlying deep learning algorithms. On three different datasets and for four different neural network models, we compared DLL to five popular deep learning frameworks. Experimentally, it is shown that the proposed framework is systematically and significantly faster on CPU and GPU. In terms of classification performance, similar accuracies as the other frameworks are reported. …

Method of Codifferential Descent (MCD) google
Method of codifferential descent (MCD) developed by professor V.F. Demyanov for solving a large class of nonsmooth nonconvex optimization problems.
“Generalised Method of Codifferential Descent”

Continue Reading…


Read More

Distilled News

Oracle acquires machine learning platform!!

Oracle announced today that it has acquired, a privately held cloud workspace platform for data science projects and workloads. Financial terms of the deal were not disclosed. In the near term, not much will change for customers of — it will continue to offer the same products and services to partners post-acquisition. But Oracle envisions combining its Cloud Infrastructure service with’s tools for a single, unified machine learning solution. “Every organization is now exploring data science and machine learning as a key way to proactively develop competitive advantage, but the lack of comprehensive tooling and integrated machine learning capabilities can cause these projects to fall short,” Amit Zavery, vice president of Oracle’s Cloud Platform, said in a statement. “With the combination of Oracle and, customers will be able to harness a single data science platform to more effectively leverage machine learning and big data for predictive analysis and improved business results.”

Introduction to Loss Functions

The loss function is the bread and butter of modern Machine Learning; it takes your algorithm from theoretical to practical and transforms neural networks from glorified matrix multiplication into Deep Learning. This post will explain the role of loss functions and how they work, while surveying a few of the most popular of the past decade.

Deep Learning Scaling is Predictable, Empirically

Deep learning (DL) creates impactful advances following a virtuous recipe: model architecture search, creating large training data sets, and scaling computation. It is widely believed that growing training sets and models should improve accuracy and result in better products. As DL application domains grow, we would like a deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements to advance the state-of-the-art. This paper presents a large scale empirical characterization of generalization error and model size growth as training sets grow. We introduce a methodology for this measurement and test four machine learning domains: machine translation, language modeling, image processing, and speech recognition. Our empirical results show power-law generalization error scaling across a breadth of factors, resulting in power-law exponents—the ‘steepness’ of the learning curve—yet to be explained by theoretical work. Further, model improvements only shift the error but do not appear to affect the power-law exponent. We also show that model size scales sublinearly with data size. These scaling relationships have significant implications on deep learning research, practice, and systems. They can assist model debugging, setting accuracy targets, and decisions about data set growth. They can also guide computing system design and underscore the importance of continued computational scaling.

An Introduction to Deep Learning for Tabular Data

There is a powerful technique that is winning Kaggle competitions and is widely used at Google (according to Jeff Dean), Pinterest, and Instacart, yet that many people don’t even realize is possible: the use of deep learning for tabular data, and in particular, the creation of embeddings for categorical variables. Despite what you may have heard, you can use deep learning for the type of data you might keep in a SQL database, a Pandas DataFrame, or an Excel spreadsheet (including time-series data). I will refer to this as tabular data, although it can also be known as relational data, structured data, or other terms (see my twitter poll and comments for more discussion).

Recommendation System in R

Recommender systems are used to predict the best products to offer to customers. These babies have become extremely popular in virtually every single industry, helping customers find products they’ll like. Most people are familiar with the idea, but nearly everyone is exposed to several forms of personalized offers and recommendations each day (Google search ads being among the biggest source). Building recommendation systems is part science, part art, and many have become extremely sophisticated. Such a system might seem daunting for those uninitiated, but it’s actually fairly straight forward to get started if you’re using the right tools. This is a post about building recommender systems in R.

Frequencies in Pandas — and a Little R Magic for Python

I’ve got a big digital mouth. Last time, I wrote on frequencies using R, noting cavalierly that I’d done similar development in Python/Pandas. I wasn’t lying, but the pertinent work I dug up from two years ago was less proof and more concept. Of course, R and Python are the two current language leaders for data science computing, while Pandas is to Python as data.table and tidyverse are to R for data management: everything. So I took on the challenge of extending the work I’d started in Pandas to replicate the frequencies functionality I’d developed in R. I was able to demonstrate to my satisfaction how it might be done, but not before running into several pitfalls.

Smart Compose: Using Neural Networks to Help Write Emails

Smart Compose: Using Neural Networks to Help Write Emails

How to Organize Data Labeling for Machine Learning: Approaches and Tools

If there was a data science hall of fame, it would have a section dedicated to labeling. The labelers’ monument could be Atlas holding that large rock symbolizing their arduous, detail-laden responsibilities. ImageNet — an image database — would deserve its own stele. For nine years, its contributors manually annotated more than 14 million images. Just thinking about it makes you tired. While labeling is not launching a rocket into space, it’s still seriously business. Labeling is an indispensable stage of data preprocessing in supervised learning. Historical data with predefined target attributes (values) is used for this model training style. An algorithm can only find target attributes if a human mapped them. Labelers must be extremely attentive because each mistake or inaccuracy negatively affects a dataset’s quality and the overall performance of a predictive model. How to get a high-quality labeled dataset without getting grey hair The main challenge is to decide who will be responsible for labeling, estimate how much time it will take, and what tools are better to use. We briefly described labeling in the article about the general structure of a machine learning project. Here we will talk more about labeling approaches, techniques, and tools.

Enterprise Dashboards with R Markdown

We have been living with spreadsheets for so long that most office workers think it is obvious that spreadsheets generated with programs like Microsoft Excel make it easy to understand data and communicate insights. Everyone in a business, from the newest intern to the CEO, has had some experience with spreadsheets. But using Excel as the de facto analytic standard is problematic. Relying exclusively on Excel produces environments where it is almost impossible to organize and maintain efficient operational workflows. In addition to fostering low productivity, organizations risk profits and reputations in an age where insightful analyses and process control translate to a competitive advantage. Most organizations want better control over accessing, distributing, and processing data. You can use the R programming language, along with with R Markdown reports and RStudio Connect, to build enterprise dashboards that are robust, secure, and manageable.

Advances in Machine Learning and Data Science – Recent Achievements and Research Directives

• Optimization of Adaptive Resonance Theory Neural Network Using Particle Swarm Optimization Technique
• Accelerating Airline Delay Prediction-Based P-CUDA Computing Environment
• IDPC-XML: Integrated Data Provenance Capture in XML
• Learning to Classify Marathi Questions and Identify Answer Type Using Machine Learning Technique
• A Dynamic Clustering Algorithm for Context Change Detection in Sensor-Based Data Stream System
• Predicting High Blood Pressure Using Decision Tree-Based Algorithm
• Design of Low-Power Area-Efficient Shift Register Using Transmission Gate
• Prediction and Analysis of Liver Patient Data Using Linear Regression Technique
• Image Manipulation Detection Using Harris Corner and ANMS
• Spatial Co-location Pattern Mining Using Delaunay Triangulation
• Review on RBFNN Design Approaches: A Case Study on Diabetes Data
• Keyphrase and Relation Extraction from Scientific Publications
• Mixing and Entrainment Characteristics of Jet Control with Crosswire
• GCV-Based Regularized Extreme Learning Machine for Facial Expression Recognition
• Prediction of Social Dimensions in a Heterogeneous Social Network
• Game Theory-Based Defense Mechanisms of Cyber Warfare
• Challenges Inherent in Building an Intelligent Paradigm for Tumor Detection Using Machine Learning Algorithms
• Segmentation Techniques for Computer-Aided Diagnosis of Glaucoma: A Review
• Performance Analysis of Information Retrieval Models on Word Pair Index Structure
• Fast Fingerprint Retrieval Using Minutiae Neighbor Structure
• Key Leader Analysis in Scientific Collaboration Network Using H-Type Hybrid Measures
• A Graph-Based Method for Clustering of Gene Expression Data with Detection of Functionally Inactive Genes and Noise
• OTAWE-Optimized Topic-Adaptive Word Expansion for Cross Domain Sentiment Classification on Tweets
• DCaP—Data Confidentiality and Privacy in Cloud Computing: Strategies and Challenges
• Design and Development of a Knowledge-Based System for Diagnosing Diseases in Banana Plants
• A Review on Methods Applied on P300-Based Lie Detectors
• Implementation of Spectral Subtraction Using Sub-band Filtering in DSP C6748 Processor for Enhancing Speech Signal
• In-silico Analysis of LncRNA-mRNA Target Prediction
• Energy Aware GSA-Based Load Balancing Method in Cloud Computing Environment
• Relative Performance Evaluation of Ensemble Classification with Feature Reduction in Credit Scoring Datasets
• Family-Based Algorithm for Recovering from Node Failure in WSN
• Classification-Based Clustering Approach with Localized Sensor Nodes in Heterogeneous WSN (CCL)
• Multimodal Biometric Authentication System Using Local Hand Features
• Automatic Semantic Segmentation for Change Detection in Remote Sensing Images
• A Model for Determining Personality by Analyzing Off-line Handwriting
• Wavelength-Convertible Optical Switch Based on Cross-Gain Modulation Effect of SOA
• Contrast Enhancement Algorithm for IR Thermograms Using Optimal Temperature Thresholding and Contrast Stretching
• Data Deduplication and Fine-Grained Auditing on Big Data in Cloud Storage

Probability and Statistics – Cookbook

This cookbook integrates a variety of topics in probability theory and statistics. It is based on literature and in-class material from courses of the statistics department at the University of California in Berkeley but also influenced by other sources.

Continue Reading…


Read More

R Packages worth a look

Apply Mapping Functions in Parallel using Futures (furrr)
Implementations of the family of map() functions from ‘purrr’ that can be resolved using any ‘future’-supported backend, e.g. parallel on the local machine or distributed on a compute cluster.

The Free Group (freegroup)
Provides functionality for manipulating elements of the free group (juxtaposition is represented by a plus) including inversion, multiplication by a scalar, group-theoretic power operation, and Tietze forms. The package is fully vectorized.

K-Distribution and Weibull Paper (kdist)
Density, distribution function, quantile function and random generation for the K-distribution. A plotting function that plots data on Weibull paper and another function to draw additional lines. See results from package in T Lamont-Smith (2018), submitted J. R. Stat. Soc.

Continue Reading…


Read More

NYC restaurants reviews and inspection scores

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)


If you ever pass outside a restaurant in New York City, you’ll notice a prominently displayed letter grade. Since July 2010, the Health Department has required restaurants to post letter grades showing sanitary inspection results.

An A grade attests to top marks for health and safety, so you can feel secure about eating there. But you don’t necessarily know that you will enjoy the food and experience courteous service. To find that out, you’d refer to the restaurant reviews. For this project, I looked at a simple data analysis and visualization of the NYC restaurants reviews and inspection scores data to find out if there is any correlation between the two. The data will also show which types of cuisines and which NYC locations tend to attract more ratings.

Nowadays, business reviews, ratings and grades are the decision making for any business to measure for their quality, popularity and future success. For restaurants business, ratings, hygienic, and cleanliness are essential. A popular site for reviews, Yelp, offers many individual ratings for restaurants. The New York City Department of Health and Mental Hygiene (DOHMH) conducts unannounced restaurant inspections annually. They check if the food handling, food temperature, personal hygiene of workers and vermin control of the restaurants are in  compliance with hygienic standards.. The scoring and grading process can be found here.

The restaurant ratings and location information used in this project come from Yelp’s API. The inspection data was downloaded from NYC open data website. I merge yelp restaurants review data and inspection data and remove NA rows which doesn’t haveeither inspection score or reviews. I also reassigned the inspection score in the grades A, B, and C category as this measure is widely used and label on restaurants. There were other scores, primarily P or Z, or some version of grade pending which we are ignoring in our analysis here. Restaurants with a score between 0 and 13 points earn an A, those with 14 to 27 points receive a B and those with 28 or more a C.


The data shows that an A is the most commonly assigned inspection grade for restaurants of all types in all locations. I plotted various bar plots to visualized the inspection scores and ratings based on borough and cuisine type.

With respect to location, this borough bar plot shows that Manhattan has highest number of restaurants with all grades compared to others. This is obvious as it has highest number of restaurants in general.  Staten Island has lowest number of restaurants with grades A, B and C among all.

As for cuisine types, the cuisines plots shows first 15 restaurants with highest number of counts for based on cuisine.  This indicates that the American cuisine has highest number of A grade compared to other. This indicate that american restaurants are focus more on hygienic and cleanliness compare to others type of restaurants.


The review plot indicates that most  restaurants do achieve the top rating of 4 stars. Again, Manhattan has the highest number of restaurants with ratings four stars while Staten Island has lowest numbers of restaurants with high ratings. It also shows that almost all borough have a low number of  2 star restaurants. Moreover, cuisine reviews plot indicates that American cuisine tend to have the highest rating compared to other cuisines. The reasons could be more American restaurants under this category then others.


The scatter plots shows therelationship between inspection score and rating. It indicates that there is no direct clear correlation between two variables. It is fairly common for a  restaurant with a C grade inspection score to achieve a 4-5 star ratings in a review. Also it is possible to find a number of A grade ratings for restaurants that only have 1-2 stars.  This could be because so long as food is tasty, people will rate the restaurant well because they do not pay very much attentions to cleanliness and hygienic issues. The scatter plots also show that though some  restaurants maintain a very high level of cleanliness and hygienic food conditions, they fail to get good ratings, which could be due to bad service or less than tasty food . We can do further analysis on both side of  restaurants by analyzing review comments and try to find why some restaurants have good reviews but low inspection score and vice-versa. This require further data about reviews comments and further analysis using NLP.



The cluster map of NYC restaurants helps visualize locations and  to filter the restaurants based cuisine types. The color mark of the point indicates the ratings and includes  descriptions of the featured restaurants. The heat map show the density of the restaurants based on borough selection or cuisine selection. It indicate which area has a greater number of restaurants. This could be helpful for business people to make informed decisions about where to  open new restaurants based on the types of restaurants already in place.

Finally, this app can be useful for people to filter the data base on borough, cuisine , ratings , and inspection grade.  The people want to go to eat with specific criteria can filters the restaurants and visit their favorite restaurants based on top marks for both ratings and inspection grades. The shiny app link is here.


To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Thanks for reading!