My Data Science Blogs

October 20, 2018

R Packages worth a look

WHOIS Server Querying (Rwhois)
Queries data from WHOIS servers.

A Shiny Application for Automatic Measurements of Tree-Ring Widths on Digital Images (MtreeRing)
Use morphological image processing and edge detection algorithms to automatically identify tree-ring boundaries on digital images. Tree-ring boundaries …

Functions for the Lognormal Distribution (lognorm)
The lognormal distribution (Limpert al. (2001) <doi:10.1641/0006-3568(2001)051[0341:lndats];2>) can characterize uncertainty that is bounde …

Continue Reading…


Read More

Book Memo: “Asymmetric Kernel Smoothing”

Theory and Applications in Economics and Finance
This is the first book to provide an accessible and comprehensive introduction to a newly developed smoothing technique using asymmetric kernel functions. Further, it discusses the statistical properties of estimators and test statistics using asymmetric kernels. The topics addressed include the bias-variance tradeoff, smoothing parameter choices, achieving rate improvements with bias reduction techniques, and estimation with weakly dependent data. Further, the large- and finite-sample properties of estimators and test statistics smoothed by asymmetric kernels are compared with those smoothed by symmetric kernels. Lastly, the book addresses the applications of asymmetric kernel estimation and testing to various forms of nonnegative economic and financial data. Until recently, the most popularly chosen nonparametric methods used symmetric kernel functions to estimate probability density functions of symmetric distributions with unbounded support. Yet many types of economic and financial data are nonnegative and violate the presumed conditions of conventional methods. Examples include incomes, wages, short-term interest rates, and insurance claims. Such observations are often concentrated near the boundary and have long tails with sparse data. Smoothing with asymmetric kernel functions has increasingly gained attention, because the approach successfully addresses the issues arising from distributions that have natural boundaries at the origin and heavy positive skewness. Offering an overview of recently developed kernel methods, complemented by intuitive explanations and mathematical proofs, this book is highly recommended to all readers seeking an in-depth and up-to-date guide to nonparametric estimation methods employing asymmetric kernel smoothing.

Continue Reading…


Read More

Magister Dixit

“I basically can’t hire people who don’t know Git.” Eric Jonas

Continue Reading…


Read More

The space race is dominated by new contenders

Private businesses and rising powers are replacing the cold-war duopoly

Continue Reading…


Read More

Supreme Court justices are increasingly political

Donald Trump’s nominee is likely to accelerate the pace

Continue Reading…


Read More

Python could become the world’s most popular coding language

But its rivals are unlikely to disappear

Continue Reading…


Read More

Science and Technology links (October 20th, 2018)

  1. Should we stop eating meat to combat climate change? Maybe not. White and Hall worked out what happened if the US stopped using farm animals:

    The modeled system without animals (…) only reduced total US greenhouse gas emissions by 2.6 percentage units. Compared with systems with animals, diets formulated for the US population in the plants-only systems (…) resulted in a greater number of deficiencies in essential nutrients. (source: PNAS)

    Of concern when considering farm animals are methane emissions. Methane is a potent greenhouse gas, with the caveat that it is short-lived in the atmosphere unlike CO2. Should we be worried about methane despite its short life? According to the American EPA (Environmental Protection Agency), total methane emissions have been falling consistently for the last 20 years. That should not surprise us: greenhouse gas emissions in most developed countries (including the US) have peaked some time ago. Not emissions per capita, but total emissions.

    So beef, at least in the US, is not a major contributor to climate change. But we could do even better. Several studies like Stanley et al. report that well managed grazing can lead to carbon sequestration in the grassland.

    There are certainly countries were animal grazing is an environmental disaster. Many industries throughout the world are a disaster and we should definitively put pressure on the guilty parties. But, in case you were wondering, if you live in a country like Canada, McDonald’s is not only serving only locally-produced beef, but they also require that it be produced in a sustainable manner.

    In any case, there are good reasons to stop eating meat, but in the developed countries like the US and Canada, climate change seems like a bogus one.

    (Special thanks to professor Leroy for providing many useful pointers.)

  2. News agencies reported this week that climate change could bring back the plague and the black death that wiped out Europe. The widely reported prediction was made by Professor Peter Frankopan while at the Cheltenham Literary Festival. Frankopan is a history professor at Oxford.
  3. There is a reverse correlation between funding and scientific output, meaning that beyond a certain point, you start getting less science for your dollars.

    (…) prestigious institutions had on average 65% higher grant application success rates and 50% larger award sizes, whereas less-prestigious institutions produced 65% more publications and had a 35% higher citation impact per dollar of funding. These findings suggest that implicit biases and social prestige mechanisms (…) have a powerful impact on where (…) grant dollars go and the net return on taxpayers investments.

    It is well documented that there is diminishing returns in research funding. Concentrating your research dollars into too few individuals is wasteful. My own explanation for this phenomenon is that, Elon Musk aside, we have all have cognitive bottlenecks. One researcher might carry fruitfully two, three major projects at the same time, but once they supervise too many students and assistants, they become a “negative manager”, meaning that make other researchers no more productive and often less productive. They spend less and less time optimizing the tools and instruments.

    If you talk with graduate students who work in lavishly funded laboratories, you will often hear (when the door is closed) about how poorly managed the projects are. People are forced into stupid directions, they do boring and useless work to satisfy project objectives that no longer make sense. Currently, “success” is often defined by how quickly you can acquire and spend money.

    But how do you optimally distribute research dollars? It is tricky because, almost by definition, almost all research is worthless. You are mining for rare events. So it is akin to venture capital investing. You want to invest into many start ups that have a high potential.

  4. A Nature columns tries to define what makes a good PhD student:

    the key attributes needed to produce a worthy PhD thesis are a readiness to accept failure; resilience; persistence; the ability to troubleshoot; dedication; independence; and a willingness to commit to very hard work — together with curiosity and a passion for research. The two most common causes of hardship in PhD students are an inability to accept failure and choosing this career path for the prestige, rather than out of any real interest in research.

Continue Reading…


Read More

Table of Contents for PIM

I am down to the home stretch for publishing my upcoming book, “A Programmer’s Introduction to Mathematics.” I don’t have an exact publication date—I’m self publishing—but after months of editing, I’ve only got two chapters left in which to apply edits that I’ve already marked up in my physical copy. That and some notes from external reviewers, and adding jokes and anecdotes and fun exercises as time allows.

I’m committing to publishing by the end of the year. When that happens I’ll post here and also on the book’s mailing list. Here’s a sneak preview of the table of contents. And a shot of the cover design (still a work in progress)


Continue Reading…


Read More

A Lazy Function

(This article was first published on CillianMacAodh, and kindly contributed to R-bloggers)

A Lazy Function


Continue Reading…


Read More

Dr. Data Show Video: How Can You Trust AI?

This new web series breaks the mold for data science infotainment, captivating the planet with short webisodes that cover the very best of machine learning and predictive analytics.

Continue Reading…


Read More

He’s a history teacher and he has a statistics question

Someone named Ian writes:

I am a History teacher who has become interested in statistics! The main reason for this is that I’m reading research papers about teaching practices to find out what actually “works.”

I’ve taught myself the basics of null hypothesis significance testing, though I confess I am no expert (Maths was never my strong point at school!). But I also came across your blog after I heard about this “replication crisis” thing.

I wanted to ask you a question, if I may.

Suppose a randomised controlled experiment is conducted with two groups and the mean difference turns out to be statistically significant at the .05 level. I’ve learnt from my self-study that this means:

“If there were genuinely no difference in the population, the probability of getting a result this big or bigger is less than 5%.”

So far, so good (or so I thought).

But from my recent reading, I’ve gathered that many people criticise studies for using “small samples.” What was interesting to me is that they criticise this even after a significant result has been found.

So they’re not saying “Your sample size was small so that may be why you didn’t find a significant result.” They’re saying: “Even though you did find a significant result, your sample size was small so your result can’t be trusted.”

I was just wondering whether you could explain why one should distrust significant results with small samples? Some people seem to be saying it’s because it may have been a chance finding. But isn’t that what the p-value is supposed to tell you? If p is less then 0.05, doesn’t that mean I can assume it (probably) wasn’t a “chance finding”?

My reply: See my paper, “The failure of null hypothesis significance testing when studying incremental changes, and what to do about it,” recently published in the Personality and Social Psychology Bulletin. The short answer is that (a) it’s not hard to get p less than 0.05 just from chance, via forking paths, and (b) when effect sizes are small and a study is noisy, any estimate that reaches “statistical significance” is likely to be an overestimate, perhaps a huge overestimate.

The post He’s a history teacher and he has a statistics question appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

Document worth reading: “Deep learning: Technical introduction”

This note presents in a technical though hopefully pedagogical way the three most common forms of neural network architectures: Feedforward, Convolutional and Recurrent. For each network, their fundamental building blocks are detailed. The forward pass and the update rules for the backpropagation algorithm are then derived in full. Deep learning: Technical introduction

Continue Reading…


Read More

Magister Dixit

“Regardless, it’s clear that Spark is a technology you can’t afford to ignore if you’re looking into modern processing of big datasets.” Donnie Berkholz ( March 13, 2015 )

Continue Reading…


Read More

Basics of Entity Resolution

Entity resolution (ER) is the task of disambiguating records that correspond to real world entities across and within datasets. The applications of entity resolution are tremendous, particularly for public sector and federal datasets related to health, transportation, finance, law enforcement, and antiterrorism.

Unfortunately, the problems associated with entity resolution are equally big — as the volume and velocity of data grow, inference across networks and semantic relationships between entities becomes increasingly difficult. Data quality issues, schema variations, and idiosyncratic data collection traditions can all complicate these problems even further. When combined, such challenges amount to a substantial barrier to organizations’ ability to fully understand their data, let alone make effective use of predictive analytics to optimize targeting, thresholding, and resource management.

Naming Your Problem

Let us first consider what an entity is. Much as the key step in machine learning is to determine what an instance is, the key step in entity resolution is to determine what an entity is. Let's define an entity as a unique thing (a person, a business, a product) with a set of attributes that describe it (a name, an address, a shape, a title, a price, etc.). That single entity may have multiple references across data sources, such as a person with two different email addresses, a company with two different phone numbers, or a product listed on two different websites. If we want to ask questions about all the unique people, or businesses, or products in a dataset, we must find a method for producing an annotated version of that dataset that contains unique entities.

How can we tell that these multiple references point to the same entity? What if the attributes for each entity aren't the same across references? What happens when there are more than two or three or ten references to the same entity? Which one is the main (canonical) version? Do we just throw the duplicates away?

Each question points to a single problem, albeit one that frequently goes unnamed. Ironically, one of the problems in entity resolution is that even though it goes by a lot of different names, many people who struggle with entity resolution do not know the name of their problem.

The three primary tasks involved in entity resolution are deduplication, record linkage, and canonicalization:

  1. Deduplication: eliminating duplicate (exact) copies of repeated data.
  2. Record linkage: identifying records that reference the same entity across different sources.
  3. Canonicalization: converting data with more than one possible representation into a standard form.

Entity resolution is not a new problem, but thanks to Python and new machine learning libraries, it is an increasingly achievable objective. This post will explore some basic approaches to entity resolution using one of those tools, the Python Dedupe library. In this post, we will explore the basic functionalities of Dedupe, walk through how the library works under the hood, and perform a demonstration on two different datasets.

About Dedupe

Dedupe is a library that uses machine learning to perform deduplication and entity resolution quickly on structured data. It isn't the only tool available in Python for doing entity resolution tasks, but it is the only one (as far as we know) that conceives of entity resolution as it's primary task. In addition to removing duplicate entries from within a single dataset, Dedupe can also do record linkage across disparate datasets. Dedupe also scales fairly well — in this post we demonstrate using the library with a relatively small dataset of a few thousand records and a very large dataset of several million.

How Dedupe Works

Effective deduplication relies largely on domain expertise. This is for two main reasons: first, because domain experts develop a set of heuristics that enable them to conceptualize what a canonical version of a record should look like, even if they've never seen it in practice. Second, domain experts instinctively recognize which record subfields are most likely to uniquely identify a record; they just know where to look. As such, Dedupe works by engaging the user in labeling the data via a command line interface, and using machine learning on the resulting training data to predict similar or matching records within unseen data.

Testing Out Dedupe

Getting started with Dedupe is easy, and the developers have provided a convenient repo with examples that you can use and iterate on. Let's start by walking through the from the dedupe-examples. To get Dedupe running, we'll need to install unidecode, future, and dedupe.

In your terminal (we recommend doing so inside a virtual environment):

git clone
cd dedupe-examples

pip install unidecode
pip install future
pip install dedupe

Then we'll run the file to see what dedupe can do:


Blocking and Affine Gap Distance

Let's imagine we own an online retail business, and we are developing a new recommendation engine that mines our existing customer data to come up with good recommendations for products that our existing and new customers might like to buy. Our dataset is a purchase history log where customer information is represented by attributes like name, telephone number, address, and order history. The database we've been using to log purchases assigns a new unique ID for every customer interaction.

But it turns out we're a great business, so we have a lot of repeat customers! We'd like to be able to aggregate the order history information by customer so that we can build a good recommender system with the data we have. That aggregation is easy if every customer's information is duplicated exactly in every purchase log. But what if it looks something like the table below?

Silvrback blog image

How can we aggregate the data so that it is unique to the customer rather than the purchase? Features in the data set like names, phone numbers, and addresses will probably be useful. What is notable is that there are numerous variations for those attributes, particularly in how names appear — sometimes as nicknames, sometimes even misspellings. What we need is an intelligent and mostly automated way to create a new dataset for our recommender system. Enter Dedupe.

When comparing records, rather than treating each record as a single long string, Dedupe cleverly exploits the structure of the input data to instead compare the records field by field. The advantage of this approach is more pronounced when certain feature vectors of records are much more likely to assist in identifying matches than other attributes. Dedupe lets the user nominate the features they believe will be most useful:

fields = [
    {'field' : 'Name', 'type': 'String'},
    {'field' : 'Phone', 'type': 'Exact', 'has missing' : True},
    {'field' : 'Address', 'type': 'String', 'has missing' : True},
    {'field' : 'Purchases', 'type': 'String'},

Dedupe scans the data to create tuples of records that it will propose to the user to label as being either matches, not matches, or possible matches. These uncertainPairs are identified using a combination of blocking , affine gap distance, and active learning.

Blocking is used to reduce the number of overall record comparisons that need to be made. Dedupe's method of blocking involves engineering subsets of feature vectors (these are called 'predicates') that can be compared across records. In the case of our people dataset above, the predicates might be things like:

  • the first three digits of the phone number
  • the full name
  • the first five characters of the name
  • a random 4-gram within the city name

Records are then grouped, or blocked, by matching predicates so that only records with matching predicates will be compared to each other during the active learning phase. The blocks are developed by computing the edit distance between predicates across records. Dedupe uses a distance metric called affine gap distance, which is a variation on Hamming distance that makes subsequent consecutive deletions or insertions cheaper.

Silvrback blog image

Silvrback blog image

Silvrback blog image

Therefore, we might have one blocking method that groups all of the records that have the same area code of the phone number. This would result in three predicate blocks: one with a 202 area code, one with a 334, and one with NULL. There would be two records in the 202 block (IDs 452 and 821), two records in the 334 block (IDs 233 and 699), and one record in the NULL area code block (ID 720).

Silvrback blog image

The relative weight of these different feature vectors can be learned during the active learning process and expressed numerically to ensure that features that will be most predictive of matches will be heavier in the overall matching schema. As the user labels more and more tuples, Dedupe gradually relearns the weights, recalculates the edit distances between records, and updates its list of the most uncertain pairs to propose to the user for labeling.

Once the user has generated enough labels, the learned weights are used to calculate the probability that each pair of records within a block is a duplicate or not. In order to scale the pairwise matching up to larger tuples of matched records (in the case that entities may appear more than twice within a document), Dedupe uses hierarchical clustering with centroidal linkage. Records within some threshold distance of a centroid will be grouped together. The final result is an annotated version of the original dataset that now includes a centroid label for each record.

Active Learning

You can see that dedupe is a command line application that will prompt the user to engage in active learning by showing pairs of entities and asking if they are the same or different.

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

Active learning is the so-called special sauce behind Dedupe. As in most supervised machine learning tasks, the challenge is to get labeled data that the model can learn from. The active learning phase in Dedupe is essentially an extended user-labeling session, which can be short if you have a small dataset and can take longer if your dataset is large. You are presented with four options:

Silvrback blog image

You can experiment with typing the y, n, and u keys to flag duplicates for active learning. When you are finished, enter f to quit.

  • (y)es: confirms that the two references are to the same entity
  • (n)o: labels the two references as not the same entity
  • (u)nsure: does not label the two references as the same entity or as different entities
  • (f)inished: ends the active learning session and triggers the supervised learning phase

Silvrback blog image

As you can see in the example above, some comparisons decisions are very easy. The first contains zero for zero hits on all four attributes being examined, so the verdict is most certainly a non-match. On the second, we have a 3/4 exact match, with the fourth being fuzzy in that one entity contains a piece of the matched entity; Ryerson vs. Chicago Public Schools Ryerson. A human would be able to discern these as two references to the same entity, and we can label it as such to enable the supervised learning that comes after the active learning.

The csv_example also includes an evaluation script that will enable you to determine how successfully you were able to resolve the entities. It's important to note that the blocking, active learning and supervised learning portions of the deduplication process are very dependent on the dataset attributes that the user nominates for selection. In the csv_example, the script nominates the following four attributes:

fields = [
    {'field' : 'Site name', 'type': 'String'},
    {'field' : 'Address', 'type': 'String'},
    {'field' : 'Zip', 'type': 'Exact', 'has missing' : True},
    {'field' : 'Phone', 'type': 'String', 'has missing' : True},

A different combination of attributes would result in a different blocking, a different set of uncertainPairs, a different set of features to use in the active learning phase, and almost certainly a different result. In other words, user experience and domain knowledge factor in heavily at multiple phases of the deduplication process.

Something a Bit More Challenging

In order to try out Dedupe with a more challenging project, we decided to try out deduplicating the White House visitors' log. Our hypothesis was that it would be interesting to be able to answer questions such as "How many times has person X visited the White House during administration Y?" However, in order to do that, it would be necessary to generate a version of the list that contained unique entities. We guessed that there would be many cases where there were multiple references to a single entity, potentially with slight variations in how they appeared in the dataset. We also expected to find a lot of names that seemed similar but in fact referenced different entities. In other words, a good challenge!

The data set we used was pulled from the website, a part of the executive initiative to make federal data more open to the public. This particular set of data is a list of White House visitor record requests from 2006 through 2010. Here's a snapshot of what the data looks like via the White House API.

Silvrback blog image

The dataset includes a lot of columns, and for most of the entries, the majority of these fields are blank:

Database Field Field Description
NAMELAST Last name of entity
NAMEFIRST First name of entity
NAMEMID Middle name of entity
UIN Unique Identification Number
BDGNBR Badge Number
Type of Access Access type to White House
TOA Time of arrival
POA Post on arrival
TOD Time of departure
POD Post on departure
APPT_MADE_DATE When the appointment date was made
APPT_START_DATE When the appointment date is scheduled to start
APPT_END_DATE When the appointment date is scheduled to end
APPT_CANCEL_DATE When the appointment date was canceled
Total_People Total number of people scheduled to attend
LAST_UPDATEDBY Who was the last person to update this event
POST Classified as 'WIN'
LastEntryDate When the last update to this instance
TERMINAL_SUFFIX ID for terminal used to process visitor
visitee_namelast The visitee's last name
visitee_namefirst The visitee's first name
MEETING_LOC The location of the meeting
MEETING_ROOM The room number of the meeting
CALLER_NAME_LAST The authorizing person for the visitor's last name
CALLER_NAME_FIRST The authorizing person for the visitor's first name
CALLER_ROOM The authorizing person's room for the visitor
Description Description of the event or visit
RELEASE_DATE The date this set of logs were released to the public

Loading the Data

Using the API, the White House Visitor Log Requests can be exported in a variety of formats to include, .json, .csv, and .xlsx, .pdf, .xlm, and RSS. However, it's important to keep in mind that the dataset contains over 5 million rows. For this reason, we decided to use .csv and grabbed the data using requests:

import requests

def getData(url,fname):
    Download the dataset from the webpage.
    response = requests.get(url)
    with open(fname, 'w') as f:

ORIGFILE = "fixtures/whitehouse-visitors.csv"


Once downloaded, we can clean it up and load it into a database for more secure and stable storage.

Tailoring the Code

Next, we'll discuss what is needed to tailor a dedupe example to get the code to work for the White House visitors log dataset. The main challenge with this dataset is its sheer size. First, we'll need to import a few modules and connect to our database:

import csv
import psycopg2
from dateutil import parser
from datetime import datetime

conn = None

DATABASE = your_db_name
USER = your_user_name
HOST = your_hostname
PASSWORD = your_password

    conn = psycopg2.connect(database=DATABASE, user=USER, host=HOST, password=PASSWORD)
    print ("I've connected")
    print ("I am unable to connect to the database")
cur = conn.cursor()

The other challenge with our dataset are the numerous missing values and datetime formatting irregularities. We wanted to be able to use the datetime strings to help with entity resolution, so we wanted to get the formatting to be as consistent as possible. The following script handles both the datetime parsing and the missing values by combining Python's dateutil module and PostgreSQL's fairly forgiving 'varchar' type.

This function takes the csv data in as input, parses the datetime fields we're interested in ('lastname','firstname','uin','apptmade','apptstart','apptend', 'meeting_loc'.), and outputs a database table that retains the desired columns. Keep in mind this will take a while to run.

def dateParseSQL(nfile):
    cur.execute('''CREATE TABLE IF NOT EXISTS visitors_er
                  (visitor_id SERIAL PRIMARY KEY,
                  lastname    varchar,
                  firstname   varchar,
                  uin         varchar,
                  apptmade    varchar,
                  apptstart   varchar,
                  apptend     varchar,
                  meeting_loc varchar);''')
    with open(nfile, 'rU') as infile:
        reader = csv.reader(infile, delimiter=',')
        next(reader, None)
        for row in reader:
            for field in DATEFIELDS:
                if row[field] != '':
                        dt = parser.parse(row[field])
                        row[field] = dt.toordinal()  # We also tried dt.isoformat()
            sql = "INSERT INTO visitors_er(lastname,firstname,uin,apptmade,apptstart,apptend,meeting_loc) \
                   VALUES (%s,%s,%s,%s,%s,%s,%s)"
            cur.execute(sql, (row[0],row[1],row[3],row[10],row[11],row[12],row[21],))
    print ("All done!")


About 60 of our rows had ASCII characters, which we dropped using this SQL command:

delete from visitors where firstname ~ '[^[:ascii:]]' OR lastname ~ '[^[:ascii:]]';

For our deduplication script, we modified the PostgreSQL example as well as Dan Chudnov's adaptation of the script for the OSHA dataset.

import tempfile
import argparse
import csv
import os

import dedupe
import psycopg2
from psycopg2.extras import DictCursor

Initially, we wanted to try to use the datetime fields to deduplicate the entities, but dedupe was not a big fan of the datetime fields, whether in isoformat or ordinal, so we ended up nominating the following fields:

KEY_FIELD = 'visitor_id'
SOURCE_TABLE = 'visitors'

FIELDS =  [{'field': 'firstname', 'variable name': 'firstname',
               'type': 'String','has missing': True},
              {'field': 'lastname', 'variable name': 'lastname',
               'type': 'String','has missing': True},
              {'field': 'uin', 'variable name': 'uin',
               'type': 'String','has missing': True},
              {'field': 'meeting_loc', 'variable name': 'meeting_loc',
               'type': 'String','has missing': True}

We modified a function Dan wrote to generate the predicate blocks:

def candidates_gen(result_set):
    lset = set
    block_id = None
    records = []
    i = 0
    for row in result_set:
        if row['block_id'] != block_id:
            if records:
                yield records

            block_id = row['block_id']
            records = []
            i += 1

            if i % 10000 == 0:
                print ('{} blocks'.format(i))

        smaller_ids = row['smaller_ids']
        if smaller_ids:
            smaller_ids = lset(smaller_ids.split(','))
            smaller_ids = lset([])

        records.append((row[KEY_FIELD], row, smaller_ids))

    if records:
        yield records

And we adapted the method from the dedupe-examples repo to handle the active learning, supervised learning, and clustering steps:

def find_dupes(args):
    deduper = dedupe.Dedupe(FIELDS)

    with psycopg2.connect(database=args.dbname,
                          cursor_factory=DictCursor) as con:
        with con.cursor() as c:
            c.execute('SELECT COUNT(*) AS count FROM %s' % SOURCE_TABLE)
            row = c.fetchone()
            count = row['count']
            sample_size = int(count * args.sample)

            print ('Generating sample of {} records'.format(sample_size))
            with con.cursor('deduper') as c_deduper:
                c_deduper.execute('SELECT visitor_id,lastname,firstname,uin,meeting_loc FROM %s' % SOURCE_TABLE)
                temp_d = dict((i, row) for i, row in enumerate(c_deduper))
                deduper.sample(temp_d, sample_size)

            if os.path.exists(
                print ('Loading training file from {}'.format(
                with open( as tf:

            print ('Starting active learning')

            print ('Starting training')
            deduper.train(ppc=0.001, uncovered_dupes=5)

            print ('Saving new training file to {}'.format(
            with open(, 'w') as training_file:


            print ('Creating blocking_map table')
                DROP TABLE IF EXISTS blocking_map
                CREATE TABLE blocking_map
                (block_key VARCHAR(200), %s INTEGER)
                """ % KEY_FIELD)

            for field in deduper.blocker.index_fields:
                print ('Selecting distinct values for "{}"'.format(field))
                c_index = con.cursor('index')
                    SELECT DISTINCT %s FROM %s
                    """ % (field, SOURCE_TABLE))
                field_data = (row[field] for row in c_index)
                deduper.blocker.index(field_data, field)

            print ('Generating blocking map')
            c_block = con.cursor('block')
                SELECT * FROM %s
                """ % SOURCE_TABLE)
            full_data = ((row[KEY_FIELD], row) for row in c_block)
            b_data = deduper.blocker(full_data)

            print ('Inserting blocks into blocking_map')
            csv_file = tempfile.NamedTemporaryFile(prefix='blocks_', delete=False)
            csv_writer = csv.writer(csv_file)

            f = open(, 'r')
            c.copy_expert("COPY blocking_map FROM STDIN CSV", f)



            print ('Indexing blocks')
                CREATE INDEX blocking_map_key_idx ON blocking_map (block_key)
            c.execute("DROP TABLE IF EXISTS plural_key")
            c.execute("DROP TABLE IF EXISTS plural_block")
            c.execute("DROP TABLE IF EXISTS covered_blocks")
            c.execute("DROP TABLE IF EXISTS smaller_coverage")

            print ('Calculating plural_key')
                CREATE TABLE plural_key
                (block_key VARCHAR(200),
                block_id SERIAL PRIMARY KEY)
                INSERT INTO plural_key (block_key)
                SELECT block_key FROM blocking_map
                GROUP BY block_key HAVING COUNT(*) > 1

            print ('Indexing block_key')
                CREATE UNIQUE INDEX block_key_idx ON plural_key (block_key)

            print ('Calculating plural_block')
                CREATE TABLE plural_block
                AS (SELECT block_id, %s
                FROM blocking_map INNER JOIN plural_key
                USING (block_key))
                """ % KEY_FIELD)

            print ('Adding {} index'.format(KEY_FIELD))
                CREATE INDEX plural_block_%s_idx
                    ON plural_block (%s)
                """ % (KEY_FIELD, KEY_FIELD))
                CREATE UNIQUE INDEX plural_block_block_id_%s_uniq
                ON plural_block (block_id, %s)
                """ % (KEY_FIELD, KEY_FIELD))

            print ('Creating covered_blocks')
                CREATE TABLE covered_blocks AS
                    (SELECT %s,
                            string_agg(CAST(block_id AS TEXT), ','
                            ORDER BY block_id) AS sorted_ids
                     FROM plural_block
                     GROUP BY %s)
                 """ % (KEY_FIELD, KEY_FIELD))

            print ('Indexing covered_blocks')
                CREATE UNIQUE INDEX covered_blocks_%s_idx
                    ON covered_blocks (%s)
                """ % (KEY_FIELD, KEY_FIELD))
            print ('Committing')

            print ('Creating smaller_coverage')
                CREATE TABLE smaller_coverage AS
                    (SELECT %s, block_id,
                        TRIM(',' FROM split_part(sorted_ids,
                        CAST(block_id AS TEXT), 1))
                        AS smaller_ids
                     FROM plural_block
                     INNER JOIN covered_blocks
                     USING (%s))
                """ % (KEY_FIELD, KEY_FIELD))

            print ('Clustering...')
            c_cluster = con.cursor('cluster')
                SELECT *
                FROM smaller_coverage
                INNER JOIN %s
                    USING (%s)
                ORDER BY (block_id)
                """ % (SOURCE_TABLE, KEY_FIELD))
            clustered_dupes = deduper.matchBlocks(
                    candidates_gen(c_cluster), threshold=0.5)

            print ('Creating entity_map table')
            c.execute("DROP TABLE IF EXISTS entity_map")
                CREATE TABLE entity_map (
                    %s INTEGER,
                    canon_id INTEGER,
                    cluster_score FLOAT,
                    PRIMARY KEY(%s)
                )""" % (KEY_FIELD, KEY_FIELD))

            print ('Inserting entities into entity_map')
            for cluster, scores in clustered_dupes:
                cluster_id = cluster[0]
                for key_field, score in zip(cluster, scores):
                        INSERT INTO entity_map
                            (%s, canon_id, cluster_score)
                        VALUES (%s, %s, %s)
                        """ % (KEY_FIELD, key_field, cluster_id, score))

            print ('Indexing head_index')
            c.execute("CREATE INDEX head_index ON entity_map (canon_id)")

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--dbname', dest='dbname', default='whitehouse', help='database name')
    parser.add_argument('-s', '--sample', default=0.10, type=float, help='sample size (percentage, default 0.10)')
    parser.add_argument('-t', '--training', default='training.json', help='name of training file')
    args = parser.parse_args()

Active Learning Observations

We ran multiple experiments:

  • Test 1: lastname, firstname, meeting_loc => 447 (15 minutes of training)
  • Test 2: lastname, firstname, uin, meeting_loc => 3385 (5 minutes of training) - one instance that had 168 duplicates

We observed a lot of uncertainty during the active learning phase, mostly because of how enormous the dataset is. This was particularly pronounced with names that seemed more common to us and that sounded more domestic since those are much more commonly occurring in this dataset. For example, are two records containing the name Michael Grant the same entity?

Additionally, we noticed that there were a lot of variations in the way that middle names were captured. Sometimes they were concatenated with the first name, other times with the last name. We also observed what seemed to be many nicknames or that could have been references to separate entities: KIM ASKEW vs. KIMBERLEY ASKEW and Kathy Edwards vs. Katherine Edwards (and yes, dedupe does preserve variations in case). On the other hand, since nicknames generally appear only in people's first names, when we did see a short version of a first name paired with an unusual or rare last name, we were more confident in labeling those as a match.

Other things that made the labeling easier were clearly gendered names (e.g. Brian Murphy vs. Briana Murphy), which helped us to identify separate entities in spite of very small differences in the strings. Some names appeared to be clear misspellings, which also made us more confident in our labeling two references as matches for a single entity (Davifd Culp vs. David Culp). There were also a few potential easter eggs in the dataset, which we suspect might actually be aliases (Jon Doe and Ben Jealous).

One of the things we discovered upon multiple runs of the active learning process is that the number of fields the user nominates to Dedupe for use has a great impact on the kinds of predicate blocks that are generated during the initial blocking phase. Thus, the comparisons that are presented to the trainer during the active learning phase. In one of our runs, we used only the last name, first name, and meeting location fields. Some of the comparisons were easy:

lastname : KUZIEMKO
firstname : ILYANA
meeting_loc : WH

lastname : KUZIEMKO
firstname : ILYANA
meeting_loc : WH

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

Some were hard:

lastname : Desimone
firstname : Daniel
meeting_loc : OEOB

lastname : DeSimone
firstname : Daniel
meeting_loc : WH

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


What we realized from this is that there are two different kinds of duplicates that appear in our dataset. The first kind of duplicate is one that generated via (likely mistaken) duplicate visitor request forms. We noticed that these duplicate entries tended to be proximal to each other in terms of visitor_id number, have the same meeting location and the same uin (which confusingly, is not a unique guest identifier but appears to be assigned to every visitor within a unique tour group). The second kind of duplicate is what we think of as the frequent flier — people who seem to spend a lot of time at the White House like staffers and other political appointees.

During the dedupe process, we computed there were 332,606 potential duplicates within the data set of 1,048,576 entities. For this particular data, we would expect these kinds of figures, knowing that people visit for repeat business or social functions.

Within-Visit Duplicates

lastname : Ryan
meeting_loc : OEOB
firstname : Patrick
uin : U62671

lastname : Ryan
meeting_loc : OEOB
firstname : Patrick
uin : U62671

Across-Visit Duplicates (Frequent Fliers)

lastname : TANGHERLINI
meeting_loc : OEOB
firstname : DANIEL
uin : U02692

lastname : TANGHERLINI
meeting_loc : NEOB
firstname : DANIEL
uin : U73085
lastname : ARCHULETA
meeting_loc : WH
firstname : KATHERINE
uin : U68121

lastname : ARCHULETA
meeting_loc : OEOB
firstname : KATHERINE
uin : U76331

Silvrback blog image


In this beginners guide to Entity Resolution, we learned what it means to identify entities and their possible duplicates within and across records. To further examine this data beyond the scope of this blog post, we would like to determine which records are true duplicates. This would require additional information to canonicalize these entities, thus allowing for potential indexing of entities for future assessments. Ultimately we discovered the importance of entity resolution across a variety of domains, such as counter-terrorism, customer databases, and voter registration.

Please return to the District Data Labs blog for upcoming posts on entity resolution and discussion about a number of other important topics to the data science community. Upcoming post topics from our research group include string matching algorithms, data preparation, and entity identification!

District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!

Continue Reading…


Read More

R Packages worth a look

R Interface to the Yacas Computer Algebra System (Ryacas)
An interface to the yacas computer algebra system.

Deep Forest Model (gcForest)
R application programming interface (API) for Deep Forest which based on Zhou and Feng (2017). Deep Forest: Towards an Alternative to Deep Neural Netwo …

Strategy Estimation (stratEst)
Implements variants of the strategy frequency estimation method by Dal Bo & Frechette (2011) <doi:10.1257/aer.101.1.411>, including its adapt …

Continue Reading…


Read More

October 19, 2018

If you did not already know

YCML google
A Machine Learning framework for Objective-C and Swift (OS X / iOS) …

Gaussian Process Autoregressive Regression Model (GPAR) google
Multi-output regression models must exploit dependencies between outputs to maximise predictive performance. The application of Gaussian processes (GPs) to this setting typically yields models that are computationally demanding and have limited representational power. We present the Gaussian Process Autoregressive Regression (GPAR) model, a scalable multi-output GP model that is able to capture nonlinear, possibly input-varying, dependencies between outputs in a simple and tractable way: the product rule is used to decompose the joint distribution over the outputs into a set of conditionals, each of which is modelled by a standard GP. GPAR’s efficacy is demonstrated on a variety of synthetic and real-world problems, outperforming existing GP models and achieving state-of-the-art performance on the tasks with existing benchmarks. …

Randomized Weighted Majority Algorithm (RWMA) google
The randomized weighted majority algorithm is an algorithm in machine learning theory. It improves the mistake bound of the weighted majority algorithm. Imagine that every morning before the stock market opens, we get a prediction from each of our ‘experts’ about whether the stock market will go up or down. Our goal is to somehow combine this set of predictions into a single prediction that we then use to make a buy or sell decision for the day. The RWMA gives us a way to do this combination such that our prediction record will be nearly as good as that of the single best expert in hindsight. “Weighted Majority Algorithm”

Continue Reading…


Read More

Book Memo: “Machine Learning Using R”

With Time Series and Industry-Based Use Cases in R
Examine the latest technological advancements in building a scalable machine-learning model with big data using R. This second edition shows you how to work with a machine-learning algorithm and use it to build a ML model from raw data. You will see how to use R programming with TensorFlow, thus avoiding the effort of learning Python if you are only comfortable with R. As in the first edition, the authors have kept the fine balance of theory and application of machine learning through various real-world use-cases which gives you a comprehensive collection of topics in machine learning. New chapters in this edition cover time series models and deep learning.

Continue Reading…


Read More

An Intuitive Guide to Financial Analysis with Data Transformations

With regards to the analysis of financial markets, there exists two major schools of thought: fundamental analysis and technical analysis. Fundamental analysis focuses on understanding the intrinsic value of a company based on information such as quarterly financial statements, cash flow, and other information about an industry in general. The goal is to discover and […]

Continue Reading…


Read More

Validating UTF-8 bytes using only 0.45 cycles per byte (AVX edition)

When receiving bytes from the network, we often assume that they are unicode strings, encoded using something called UTF-8. Sadly, not all streams of bytes are valid UTF-8. So we need to check the strings. It is probably a good idea to optimize this problem as much as possible.

In earlier work, we showed that you could validate a string using a little as 0.7 cycles per byte, using commonly available 128-bit SIMD registers (in C). SIMD stands for Single-Instruction-Multiple-Data, it is a way to parallelize the processing on a single core.

What if we use 256-bit registers instead?

Reference naive function 10 cycles per byte
fast SIMD version (128-bit) 0.7 cycles per byte
new SIMD version (256-bit) 0.45 cycles per byte

That’s good, almost twice as fast.

A common problem is that you receive as inputs ASCII characters. That’s a common scenario. It is much faster to check that a string in made of ASCII characters than to check that it is made of valid UTF-8 characters. Indeed, to check that it is made of ASCII characters, you only have to check that one bit per byte is zero (since ASCII uses only 7 bits per byte).

It turns out that only about 0.05 cycles are needed to check that a string is made of ASCII characters. Maybe up to 0.08 cycles. That makes us look bad.

You could start checking the file for ASCII characters and then switch to our function when non-ASCII characters are found, but this has a problem: what if the string starts with a non-ASCII character followed by a long stream of ASCII characters?

A quick solution is to add an ASCII path. Each time we read a block of 32 bytes, we check whether it is made of 32 ASCII characters, and if so, we take a different (fast) path. Thus if it happens frequently that we have long streams of ASCII characters, we will be quite fast.

The new numbers are quite appealing when running benchmarks on ASCII characters:

new SIMD version (256-bit) 0.45 cycles per byte
new SIMD version (256-bit), w. ASCII path 0.088 cycles per byte
ASCII check (SIMD + 256-bit) 0.051 cycles per byte

My code is available.

Continue Reading…


Read More

Start your journey into data science today

Springboard’s Introduction to Data Science Course will help you build a strong foundation in R programming, communicate effectively by telling a story with data, clean and analyze large datasets, and more. Apply before Oct 22 and use code KDNUGGETSOCT500 for $500 Data Science Career Track.

Continue Reading…


Read More

Statistics Challenge Invites Students to Tackle Opioid Crisis Using Real-World Data

(This article was first published on, and kindly contributed to R-bloggers)

In 2016, 2.1 million Americans were found to have an opioid use disorder (according to SAMHSA), with drug overdose now the leading cause of injury and death in the United States. But some of the country’s top minds are working to fight this epidemic, and statisticians are helping to lead the charge. 

In This is Statistics’ second annual fall data challenge, high school and undergraduate students will use statistics to analyze data and develop recommendations to help address this important public health crisis. 

The contest invites teams of two to five students to put their statistical and data visualization skills to work using the Centers for Disease Control and Prevention (CDC)’s Multiple Cause of Death (Detailed Mortality) data set, and contribute to creating healthier communities. Given the size and complexity of the CDC dataset, programming languages such as R can be used to manipulate and conduct analysis effectively.

Each submission will consist of a short essay and presentation of recommendations. Winners will be awarded for best overall analysis, best visualization and best use of external data. Submissions are due November 12, 2018.

If you or a student you know is interested in participating, get full contest details here

Teachers, get resources about how to engage your students in the contest here.

To leave a comment for the author, please follow the link and comment on their blog: offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Will Models Rule the World? Data Science Salon Miami, Nov 6-7

This post is excerpted from the thoughts of Data Science Salon Miami speakers on the future of model-based decision-making.

Continue Reading…


Read More

Is your Precision Medicine AI ready for the Bio-Psycho-Socio-Cultural Patient?

The data necessary to account for every aspect of our human complexity poses significant challenges to health AI systems. There’s certainly no way around data science to get a hold of it – but don’t you count physicians out too soon! Precision Medicine bears the promise to bring highly individualized

The post Is your Precision Medicine AI ready for the Bio-Psycho-Socio-Cultural Patient? appeared first on Dataconomy.

Continue Reading…


Read More

Gold-Mining Week 7 (2018)

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

Week 7 Gold Mining and Fantasy Football Projection Roundup now available. Go check out our cheat sheet for this week.

The post Gold-Mining Week 7 (2018) appeared first on Fantasy Football Analytics.

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Holy Grail of AI for Enterprise — Explainable AI

Explainable AI (XAI) is an emerging branch of AI where AI systems are made to explain the reasoning behind every decision made by them. We investigate some of its key benefits and design principles.

Continue Reading…


Read More

New Course: Interactive Data Visualization with rbokeh

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Here is the course link.

Course Description

Data visualization is an integral part of the data analysis process. This course will get you introduced to rbokeh: a visualization library for interactive web-based plots. You will learn how to use rbokeh layers and options to create effective visualizations that carry your message and emphasize your ideas. We will focus on the two main pieces of data visualization: wrangling data in the appropriate format as well as employing the appropriate visualization tools, charts and options from rbokeh.

Chapter 1: rbokeh Introduction (Free)

In this chapter we get introduced to rbokeh layers. You will learn how to specify data and arguments to create the desired plot and how to combine multiple layers in one figure.

Chapter 2: rbokeh Aesthetic Attributes and Figure Options

In this chapter you will learn how to customize your rbokeh figures using aesthetic attributes and figure options. You will see how aesthetic attributes such as color, transparancy and shape can serve a purpose and add more info to your visualizations. In addition, you will learn how to activate the tooltip and specify the hover info in your figures.

Chapter 3: Data Manipulation for Visualization and More rbokeh Layers

In this chapter, you will learn how to put your data in the right format to fit the desired figure. And how to transform between the wide and long formats. You will also see how to combine normal layers with regression lines. In addition you will learn how to customize the interaction tools that appear with each figure.

Chapter 4: Grid Plots and Maps

In this chapter you will learn how to combine multiple plots in one layout using grid plots. In addition, you will learn how to create interactive maps.


To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

New Course: Visualization Best Practices in R

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Here is the course link.

Course Description

This course will help you take your data visualization skills beyond the basics and hone them into a powerful member of your data science toolkit. Over the lessons we will use two interesting open datasets to cover different types of data (proportions, point-data, single distributions, and multiple distributions) and discuss the pros and cons of the most common visualizations. In addition, we will cover some less common alternatives visualizations for the data types and how to tweak default ggplot settings to most efficiently and effectively get your message across.

Chapter 1: Proportions of a whole (Free)

In this chapter, we focus on visualizing proportions of a whole; we see that pie charts really aren’t so bad, along with discussing the waffle chart and stacked bars for comparing multiple proportions.

Chapter 2: Point data

We shift our focus now to single-observation or point data and go over when bar charts are appropriate and when they are not, what to use when they are not, and general perception-based enhancements for your charts.

Chapter 3: Single distributions

We now move on to visualizing distributional data, we expose the fragility of histograms, discuss when it is better to shift to a kernel density plots, and how to make both plots work best for your data.

Chapter 4: Comparing distributions

Finishing off we take a look at comparing multiple distributions to each other. We see why the traditional box plots are very dangerous and how to easily improve them, along with investigating when you should use more advanced alternatives like the beeswarm plot and violin plots.


To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

The Intuitions Behind Bayesian Optimization with Gaussian Processes

Bayesian Optimization adds a Bayesian methodology to the iterative optimizer paradigm by incorporating a prior model on the space of possible target functions. This article introduces the basic concepts and intuitions behind Bayesian Optimization with Gaussian Processes.

Continue Reading…


Read More

An actual quote from a paper published in a medical journal: “The data, analytic methods, and study materials will not be made available to other researchers for purposes of reproducing the results or replicating the procedure.”

Someone writes:

So the NYT yesterday has a story about this study I am directed to it and am immediately concerned about all the things that make this study somewhat dubious. Forking paths in the definition of the independent variable, sample selection in who wore the accelerometers, ignorance of the undoubtedly huge importance of interactions in the controls, etc, etc. blah blah blah. But I am astonished at the bald statement at the start of the study: “The data, analytic methods, and study materials will not be made available to other researchers for purposes of reproducing the results or replicating the procedure.” Why shouldn’t everyone, including the NYT, stop reading right there? How does a journal accept the article? The dataset itself is public and they didn’t create it! They’re just saying Fuck You.

I was, like, Really? So I followed the link. And, indeed, here it is:

The Journal of the American Heart Association published this? And the New York Times promoted it?

As a heart patient myself, I’m annoyed. I’d give it a subliminal frowny face, but I don’t want to go affecting your views on immigration.

P.S. My correspondent adds:

By the way, I started Who is Rich? this week and it’s great.

P.P.S. The above all happened half a year ago. Today my post appeared, and then I received a note from Joseph Hilgard saying informing me that this statement, “The data, analytic methods, and study materials will not be made available to other researchers for purposes of reproducing the results or replicating the procedure,” apparently is a technical requirement of TOP when the data are already publicly available—not a defiant statement from the authors. Hilgard also informed me that TOP is “Transparency and Openness Promotion guidelines. Journal-level standards for how firm a journal wants to be about requesting data sharing, code, etc.”

I remain baffled as to why, if the data are already publicly available, you couldn’t just say, “The data are already publicly available,” and also why you have to say that the analytic methods and study materials will not be made available. I can believe that this is a requirement of the journal. Various organizations have various screwy requirements, there are millions of forms to be filled out and hoops to be jumped through, etc. And the end result—no details on how the data were processed, no code, etc.—that’s not good in any case. It should be easy to have reproducible research when the data are public.

It’s amazing how fast standards have changed. Back when we published our Red State Blue State book, ten years ago, we didn’t even think of posting all our data and code. Partly because it was such a mess, with five coauthors doing different things at different times, but also because this was not something that was usually done. I felt we were ahead of the game by including careful descriptions of our methods in the notes section at the end of the book. But there’s a big gap between my written descriptions and all the details of what we did. When it comes to scientific communication, things have changed for the better.

Let’s just hope that the Center for Open Science and the Journal of the American Heart Association can fix this particular bug, which seems at least in this case to have encouraged researchers to not make their methods and study materials available.

The post An actual quote from a paper published in a medical journal: “The data, analytic methods, and study materials will not be made available to other researchers for purposes of reproducing the results or replicating the procedure.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

McKinsey Datathon: The City Cup17 November, Amsterdam, Stockholm and Zurich. Apply Now

While solving the challenge, you will gain insights into the types of problems that McKinsey Data Scientists solve daily to help their clients. Top prize is 5K Euro + conference attendance of your choice.

Continue Reading…


Read More

Education disparities

There are many racial disparities in education. ProPublica shows estimates for the gaps:

Based on civil rights data released by the U.S. Department of Education, ProPublica has built an interactive database to examine racial disparities in educational opportunities and school discipline. Look up more than 96,000 individual public and charter schools and 17,000 districts to see how they compare with their counterparts.

Using white students as the baseline, compare opportunity, discipline, segregation, and achievement for black and Hispanic students.

Be sure to click through to a school district or state of interest to see more detailed breakdowns of the measures.

Tags: , ,

Continue Reading…


Read More

Four short links: 19 October 2018

PDF to Data Frame, Clever Story, Conceptual Art, and Automatic Patch Synthesis

  1. Camelot -- Python library that extracts tables of data from PDF documents, returning them as Pandas frames.
  2. STET -- short story told via footnotes, editorial markup, and more. Magnificent! (via Cory Doctorow)
  3. Solving Sol -- interpreting a conceptual artist's art as instructions, reframed as an AI problem. Clever!
  4. Human-Competitive Patches with Repairnator -- Repairnator is a bot. It constantly monitors software bugs discovered during continuous integration of open source software and tries to fix them automatically. If it succeeds to synthesize a valid patch, Repairnator proposes the patch to the human developers, disguised under a fake human identity. To date, Repairnator has been able to produce five patches that were accepted by the human developers and permanently merged in the code base.

Continue reading Four short links: 19 October 2018.

Continue Reading…


Read More

Loops and Pizzas

(This article was first published on Marcelo S. Perlin, and kindly contributed to R-bloggers)

An Introduction to Loops in R –

Loops in R

First, if you are new to programming, you should know that loops are a
way to tell the computer that you want to repeat some operation for a
number of times. This is a very common task that can be found in many
programming languages. For example, let’s say you invited five friends
for dinner at your home and the whole cost of four pizzas will be split
evenly. Assume now that you must give instructions to a computer on
calculating how much each one will pay at the end of dinner. For that,
you need to sum up the individual tabs and divide by the number of
people. Your instructions to the computer could be: start with a value
of x=zero, take each individual pizza cost and sum it to x until all
costs are processed, dividing the result by the number of friends at the

The great thing about loops is that the length of it is dynamically
set. Using the previous example, if we had 500 friends (and a large
dinner table!), we could use the same instructions for calculating the
individual tabs. That means we can encapsulate a generic procedure for
processing any given number of friends at dinner. With it, you have at
your reach a tool for the execution of any sequential process. In other
words, you are the boss of your computer and, as long as you can write
it down clearly, you can set it to do any kind of repeated task for you.

Now, about the code, we could write the solution to the pizza problem
in R as:

pizza.costs <- c(50, 80, 30, 60) # each cost of pizza
n.friends <- 5 # number of friends

x <- 0 # set first cost to zero
for (i.cost in pizza.costs) {
  x <- x + i.cost # sum it up

x <- x/n.friends # divide for average per friend

## [1] 44

Don’t worry if you didn’t understand the code. We’ll get to the
structure of a loop soon.

Back to our case, each friend would pay 44 for the meal. We can check
the result against function sum:

x == sum(pizza.costs)/n.friends

## [1] TRUE

The output TRUE shows that the results are equal.

The Structure of a Loop

Knowing how to use loops can be a powerful ally in a complex data
related problem. Let’s talk more about how loops are defined in R. The
structure of a loop in R follows:

for (i in i.vec){

In the previous code, command for indicates the beginning of a loop.
Object i in (i in i.vec) is the iterator of the loop. This
iterator will change its value in each iteration, taking each individual
value contained in i.vec. Note the loop is encapsulated by curly
braces ({}). These are important, as they define where the loop
starts and where it ends. The indentation (use of bigger margins) is
also important for visual cues, but not necessary. Consider the
following practical example:

# set seq
my.seq <- seq(-5,5)

# do loop
for (i in my.seq){
  cat(paste('\nThe value of i is',i))

## The value of i is -5
## The value of i is -4
## The value of i is -3
## The value of i is -2
## The value of i is -1
## The value of i is 0
## The value of i is 1
## The value of i is 2
## The value of i is 3
## The value of i is 4
## The value of i is 5

In the code, we created a sequence from -5 to 5 and presented a text for
each element with the cat function. Notice how we also broke the
prompt line with '\n'. The loop starts with i=-5, execute command
cat(paste('\nThe value of i is', -5)), proceed to the next iteration
by setting i=-4, rerun the cat command, and so on. At its final
iteration, the value of i is 5.

The iterated sequence in the loop is not exclusive to numerical
vectors. Any type of vector or list may be used. See next:

# set char vec
my.char.vec <- letters[1:5]

# loop it!
for (i.char in my.char.vec){
  cat(paste('\nThe value of i.char is', i.char))

## The value of i.char is a
## The value of i.char is b
## The value of i.char is c
## The value of i.char is d
## The value of i.char is e

The same goes for lists:

# set list
my.l <- list(x = 1:5, 
             y = c('abc','dfg'), 
             z = factor('A','B','C','D'))

# loop list
for (i.l in my.l){
  cat(paste0('\nThe class of i.l is ', class(i.l), '. '))
  cat(paste0('The number of elements is ', length(i.l), '.'))

## The class of i.l is integer. The number of elements is 5.
## The class of i.l is character. The number of elements is 2.
## The class of i.l is factor. The number of elements is 1.

In the definition of loops, the iterator does not have to be the only
object incremented in each iteration. We can create other objects and
increment them using a simple sum operation. See next:

# set vec and iterators
my.vec <- seq(1:5)
my.x <- 5
my.z <- 10

for (i in my.vec){
  # iterate "manually"
  my.x <- my.x + 1
  my.z <- my.z + 2
  cat('\nValue of i = ', i, 
      ' | Value of my.x = ', my.x, 
      ' | Value of my.z = ', my.z)

## Value of i =  1  | Value of my.x =  6  | Value of my.z =  12
## Value of i =  2  | Value of my.x =  7  | Value of my.z =  14
## Value of i =  3  | Value of my.x =  8  | Value of my.z =  16
## Value of i =  4  | Value of my.x =  9  | Value of my.z =  18
## Value of i =  5  | Value of my.x =  10  | Value of my.z =  20

Using nested loops, that is, a loop inside of another loop is also
possible. See the following example, where we present all the elements
of a matrix:

# set matrix
my.mat <- matrix(1:9, nrow = 3)

# loop all values of matrix
for (i in seq(1,nrow(my.mat))){
  for (j in seq(1,ncol(my.mat))){
    cat(paste0('\nElement [', i, ', ', j, '] = ', my.mat[i,j]))

## Element [1, 1] = 1
## Element [1, 2] = 4
## Element [1, 3] = 7
## Element [2, 1] = 2
## Element [2, 2] = 5
## Element [2, 3] = 8
## Element [3, 1] = 3
## Element [3, 2] = 6
## Element [3, 3] = 9

A Real World Example

Now, the computational needs of the real world is far more complex than
dividing a dinner expense. A practical example of using loops is
processing data according to groups. Using an example from Finance, if
we have a return dataset for several stocks and we want to calculate the
average return of each stock, we can use a loop for that. In this
example, we will use Yahoo Finance data from three stocks: FB, GE and
AA. The first step is downloading it with package BatchGetSymbols.


## Loading required package: rvest

## Loading required package: xml2

## Loading required package: dplyr

## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
##     filter, lag

## The following objects are masked from 'package:base':
##     intersect, setdiff, setequal, union


my.tickers <-  c('FB', 'GE', 'AA')

df.stocks <- BatchGetSymbols(tickers = my.tickers, 
                    = '2012-01-01', 
                    = 'yearly')[[2]]

## Running BatchGetSymbols for:
##    tickers = FB, GE, AA
##    Downloading data for benchmark ticker | Found cache file
## FB | yahoo (1|3) | Found cache file - Good job!
## GE | yahoo (2|3) | Found cache file - Nice!
## AA | yahoo (3|3) | Found cache file - You got it!

It worked fine. Let’s check the contents of the dataframe:


## Observations: 21
## Variables: 10
## $ ticker               "AA", "AA", "AA", "AA", "AA", "AA", "AA", ...
## $             2012-01-03, 2013-01-02, 2014-01-02, 2015-...
## $ volume               2217410500, 2149575500, 2146821400, 268355...
## $           21.48282, 21.33864, 25.30359, 38.13561, 22...
## $ price.high           25.85628, 25.68807, 42.29280, 41.01921, 32...
## $ price.low            19.27206, 18.50310, 24.27030, 18.79146, 16...
## $ price.close          22.17969, 21.60297, 25.30359, 38.15964, 23...
## $ price.adjusted       20.89342, 20.62187, 24.48568, 37.24207, 23...
## $ ret.adjusted.prices  NA, -0.01299715, 0.18736494, 0.52097326, -...
## $ ret.closing.prices   NA, -0.02600212, 0.17130149, 0.50807215, -...

All financial data is there. Notice that the return series is available
at column ret.adjusted.prices.

Now we will use a loop to build a table with the mean return of each

# find unique tickers in column ticker
unique.tickers <- unique(df.stocks$ticker)

# create empty df
tab.out <- data.frame()

# loop tickers
for (i.ticker in unique.tickers){
  # create temp df with ticker i.ticker
  temp <- df.stocks[df.stocks$ticker==i.ticker, ]
  # row bind i.ticker and mean.ret
  tab.out <- rbind(tab.out, 
                   data.frame(ticker = i.ticker,
                              mean.ret = mean(temp$ret.adjusted.prices, na.rm = TRUE)))

# print result

##   ticker   mean.ret
## 1     AA 0.24663684
## 2     FB 0.35315566
## 3     GE 0.06784693

In the code, we used function unique to find out the names of all the
tickers in the dataset. Soon after, we create an empty dataframe to
save the results and a loop to filter the data of each stock
sequentially and average its returns. At the end of the loop, we use
function rbind to paste the results of each stock with the results of
the main table. As you can see, we can use the data to perform group
calculations with loop.

By now, I must be forward in saying that the previous loop is by no
means the best way of performing the data operation. What we just did by
loops is called a split-apply-combine procedure. There are base
function in R such as tapply, split and lapply/sapply that can
do the same job but with a more intuitive and functional approach. Going
further, functions from package tidyverse can do the same procuedure
with an even more intuitive approach. In a future post I shall discuss
this possibilities further.

I hope you guys liked the post. Got a question? Just drop it at the
comment section.

To leave a comment for the author, please follow the link and comment on their blog: Marcelo S. Perlin. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

Document worth reading: “Review of Deep Learning”

In recent years, China, the United States and other countries, Google and other high-tech companies have increased investment in artificial intelligence. Deep learning is one of the current artificial intelligence research’s key areas. This paper analyzes and summarizes the latest progress and future research directions of deep learning. Firstly, three basic models of deep learning are outlined, including multilayer perceptrons, convolutional neural networks, and recurrent neural networks. On this basis, we further analyze the emerging new models of convolution neural networks and recurrent neural networks. This paper then summarizes deep learning’s applications in many areas of artificial intelligence, including voice, computer vision, natural language processing and so on. Finally, this paper discusses the existing problems of deep learning and gives the corresponding possible solutions. Review of Deep Learning

Continue Reading…


Read More

If you did not already know

Factorized Adversarial Network (FAN) google
In this paper, we propose Factorized Adversarial Networks (FAN) to solve unsupervised domain adaptation problems for image classification tasks. Our networks map the data distribution into a latent feature space, which is factorized into a domain-specific subspace that contains domain-specific characteristics and a task-specific subspace that retains category information, for both source and target domains, respectively. Unsupervised domain adaptation is achieved by adversarial training to minimize the discrepancy between the distributions of two task-specific subspaces from source and target domains. We demonstrate that the proposed approach outperforms state-of-the-art methods on multiple benchmark datasets used in the literature for unsupervised domain adaptation. Furthermore, we collect two real-world tagging datasets that are much larger than existing benchmark datasets, and get significant improvement upon baselines, proving the practical value of our approach. …

Reinforced Continual Learning google
Most artificial intelligence models have limiting ability to solve new tasks faster, without forgetting previously acquired knowledge. The recently emerging paradigm of continual learning aims to solve this issue, in which the model learns various tasks in a sequential fashion. In this work, a novel approach for continual learning is proposed, which searches for the best neural architecture for each coming task via sophisticatedly designed reinforcement learning strategies. We name it as Reinforced Continual Learning. Our method not only has good performance on preventing catastrophic forgetting but also fits new tasks well. The experiments on sequential classification tasks for variants of MNIST and CIFAR-100 datasets demonstrate that the proposed approach outperforms existing continual learning alternatives for deep networks. …

Reliability Modelling google
Reliability modeling is the process of predicting or understanding the reliability of a component or system prior to its implementation. Two types of analysis that are often used to model a complete system’s availability behavior (including effects from logistics issues like spare part provisioning, transport and manpower) are Fault Tree Analysis and reliability block diagrams. At a component level, the same types of analyses can be used together with others. The input for the models can come from many sources including testing; prior operational experience; field data; as well as data handbooks from similar or related industries. Regardless of source, all model input data must be used with great caution, as predictions are only valid in cases where the same product was used in the same context. As such, predictions are often only used to help compare alternatives. …

Continue Reading…


Read More

Revised and Extended Remarks at "The Rise of Intelligent Economies and the Work of the IMF"

Attention conservation notice: 2700+ words elaborating a presentation from a non-technical conference about AI, where the conversation devolved to "blockchain" within an hour; includes unexplained econometric jargon. Life is short, and you should have more self-respect.

I got asked to be a panelist at a November 2017 symposium at the IMF on machine learning, AI and what they can do to/for the work of the Fund and its sister organizations, specifically the work of its economists. What follows is an amplification and rationalization of my actual remarks. It is also a reconstruction, since my notes were on an only-partially-backed-up laptop stolen in the next month. (Roman thieves are perhaps the most dedicated artisans in Italy, plying their trade with gusto on Christmas Eve.) Posted now because reasons.

On the one hand, I don't have any products to sell, or even much of a consulting business to promote, so I feel a little bit out of place. But against that, there aren't many other people who work on machine learning who read macro and development economics for fun, or have actually estimated a DSGE model from data, so I don't feel totally fradulent up here.

We've been asked to talk about AI and machine learning, and how they might impact the work of the Fund and related multi-lateral organizations. I've never worked for the Fund or the World Bank, but I do understand a bit about how you economists work, and it seems to me that there are three important points to make: a point about data, a point about models, and a point about intelligence. The first of these is mostly an opportunity, the second is an opportunity and a clarification, and the third is a clarification and a criticism --- so you can tell I'm an academic by taking the privilege of ending on a note of skepticism and critique, rather than being inspirational.

I said my first point is about data --- in fact, it's about what, a few turns of the hype cycle ago, we'd have called "big data". Economists at the Fund typically rely for data on the output of official statistical agencies from various countries. This is traditional, this sort of reliance on the part of economists actually pre-dates the Bretton Woods organizations, and there are good reasons for it. With a few notable exceptions, those official statistics are prepared very carefully, with a lot of effort going in to making them both precise and accurate, as well as comparable over time and, increasingly, across countries.

But even these official statistics have their issues, for the purposes of the Fund: they are slow, they are noisy, and they don't quite measure what you want them to.

The issue of speed is familiar: they come out annually, maybe quarterly or monthly. This rate is pretty deeply tied to the way the statistics are compiled, which in turn is tied to their accuracy --- at least for the foreseeable future. It would be nice to be faster.

The issue of noise is also very real. Back in 1950, the great economist Oskar Morgenstern, the one who developed game theory with John von Neumann, wrote a classic book called On the Accuracy of Economic Observations, where he found a lot of ingenious ways of checking the accuracy of official statistics, e.g., looking at how badly they violated accounting identities. To summarize very crudely, he concluded that lots of those statistics couldn't possibly be accurate to better than 10%, maybe 5% --- and this was for developed countries with experienced statistical agencies. I'm sure that things are better now --- I'm not aware of anyone exactly repeating his efforts, but it'd be a worthwhile exercise --- maybe the error is down to 1%, but that's still a lot, especially to base policy decisions on.

The issue of measurement is the subtlest one. I'm not just talking about measurement noise now. Instead, it's that the official statistics are often tracking variables which aren't quite what you want1. Your macroeconomic model might, for example, need to know about the quantity of labor available for a certain industry in a certain country. But the theory in that model defines "quantity of labor" in a very particular way. The official statistical agencies, on the other hand, will have their own measurements of "quantity of labor", and none of those need to have exactly the same definitions. So even if we could magically eliminate measurement errors, just plugging the official value for "labor" in to your model isn't right, that's just an approximate, correlated quantity.

So: official statistics, which is what you're used to using, are the highest-quality statistics, but they're also slow, noisy, and imperfectly aligned with your models. There hasn't been much to be done about that for most of the life of the Fund, though, because what was your alternative?

What "big data" can offer is the possibility of a huge number of noisy, imperfect measures. Computer engineers --- the people in hardware and systems and databases, not in machine learning or artificial intelligence --- have been making it very, very cheap and easy to record, store, search and summarize all the little discrete facts about our economic lives, to track individual transactions and aggregate them into new statistics. (Moving so much of our economic lives, along with all the rest of our lives, on to the Internet only makes it easier.) This could, potentially, give you a great many aggregate statistics which tell you, in a lot of detail and at high frequency, about consumption, investment, employment, interest rates, finance, and so on and so forth. There would be lots of noise, but having a great many noisy measurements could give you a lot more information. It's true that basically none of them would be well-aligned with the theoretical variables in macro models, but there are well-established statistical techniques for using lots of imperfect proxies to track a latent, theoretical variable, coming out of factor-analysis and state-space modeling. There have been some efforts already to incorporate multiple imperfect proxies into things like DSGE models.

I don't want get carried away here. The sort of ubiquitous recording I'm talking about is obviously more advanced in richer countries than in poorer ones --- it will work better in, say, South Korea, or even Indonesia, than in Afghanistan. It's also unevenly distributed within national economies. Getting hold of the data, even in summary forms, would require a lot of social engineering on the part of the Fund. The official statistics, slow and imperfect as they are, will always be more reliable and better aligned to your models. But, wearing my statistician hat, my advice to economists here is to get more information, and this is one of the biggest ways you can expand your information set.

The second point is about models --- it's a machine learning point. The dirty secret of the field, and of the current hype, is that 90% of machine learning is a rebranding of nonparametric regression. (I've got appointments in both ML and statistics so I can say these things without hurting my students.) I realize that there are reasons why the overwhelming majority of the time you work with linear regression, but those reasons aren't really about your best economic models and theories. Those reasons are about what has, in the past, been statistically and computationally feasible to estimate and work with. (So they're "economic" reasons in a sense, but about your own economies as researchers, not about economics-as-a-science.) The data will never completely speak for itself, you will always need to bring some assumptions to draw inferences. But it's now possible to make those assumptions vastly weaker, and to let the data say a lot more. Maybe everything will turn out to be nice and linear, but even if that's so, wouldn't it be nice to know that, rather than to just hope?

There is of course a limitation to using more flexible models, which impose fewer assumptions, which is that it makes it easier to "over-fit" the data, to create a really complicated model which basically memorizes every little accident and even error in what it was trained on. It may not, when you examine it, look like it's just memorizing, it may seem to give an "explanation" for every little wiggle. It will, in effect, say things like "oh, sure, normally the central bank raising interest rates would do X, but in this episode it was also liberalizing the capital account, so Y". But the way to guard against this, and to make sure your model, or the person selling you their model, isn't just BS-ing is to check that it can actually predict out-of-sample, on data it didn't get to see during fitting. This sort of cross-validation has become second nature for (honest and competent) machine learning practitioners.

This is also where lots of ML projects die. I think I can mention an effort at a Very Big Data Indeed Company to predict employee satisfaction and turn-over based on e-mail activity, which seemed to work great on the training data, but turned out to be totally useless on the next year's data, so its creators never deployed it. Cross-validation should become second nature for economists, and you should be very suspicious of anyone offering you models who can't tell you about their out-of-sample performance. (If a model can't even predict well under a constant policy, why on Earth would you trust it to predict responses to policy changes?)

Concretely, going forward, organizations like the Fund can begin to use much more flexible modeling forms, rather than just linear models. The technology to estimate them and predict from them quickly now exists. It's true that if you fit a linear regression and a non-parametric regression to the same data set, the linear regression will always have tighter confidence sets, but (as Jeffrey Racine says) that's rapid convergence to a systematically wrong answer. Expanding the range and volume of data used in your economic modeling, what I just called the "big data" point, will help deal with this, and there's a tremendous amount of on-going progress in quickly estimating flexible models on truly enormous data sets. You might need to hire some people with Ph.D.s in statistics or machine learning who also know some economics --- and by coincidence I just so happen to help train such people! --- but it's the right direction to go, to help your policy decisions be dictated by the data and by good economics, and not by what kinds of models were computationally feasible twenty or even sixty years ago.

The third point, the most purely cautionary one, is the artificial intelligence point. This is that almost everything people are calling "AI" these days is just machine learning, which is to say, nonparametric regression. Where we have seen breakthroughs is in the results of applying huge quantities of data to flexible models to do very particular tasks in very particular environments. The systems we get from this are really good at that, but really fragile, in ways that don't mesh well with our intuition about human beings or even other animals. One of the great illustrations of this are what are called "adversarial examples", where you can take an image that a state-of-the-art classifier thinks is, say, a dog, and by tweaking it in tiny ways which are imperceptible to humans, you can make the classifier convinced it's, say, a car. On the other hand, you can distort that picture of a dog into an image something unrecognizable by any person while the classifier is still sure it's a dog.

If we have to talk about our learning machines psychologically, try not to describe them as automating thought or (conscious) intelligence, but rather as automating unconscious perception or reflex action. What's now called "deep learning" used to be called "perceptrons", and it was very much about trying to do the same sort of thing that low-level perception in animals does, extracting features from the environment which work in that environment to make a behaviorally-relevant classification2 or prediction or immediate action. This is the sort of thing we're almost never conscious of in ourselves, but is in fact what a huge amount of our brains are doing. (We know this because we can study how it breaks down in cases of brain damage.) This work is basically inaccessible to consciousness --- though we can get hints of it from visual illusions, and from the occasions where it fails, like the shock of surprise you feel when you put your foot on a step that isn't there. This sort of perception is fast, automatic, and tuned to very, very particular features of the environment.

Our current systems are like this, but even more finely tuned to narrow goals and contexts. This is why the have such alien failure-modes, and why they really don't have the sort of flexibility we're used to from humans or other animals. They generalize to more data from their training environment, but not to new environments. If you take a person who's learned to play chess and give them a 9-by-9 board with an extra rook on each side, they'll struggle but they won't go back to square one; AlphaZero will need to relearn the game from scratch. Similarly for the video-game learners, and just about everything else you'll see written up in the news, or pointed out as a milestone in a conference like this. Rodney Brooks, one of the Revered Elders of artificial intelligence, puts it nicely recently, saying that the performances of these systems give us a very misleading idea of their competences3.

One reason these genuinely-impressive and often-useful performances don't indicate human competences is that these systems work in very alien ways. So far as we can tell4, there's little or nothing in them that corresponds to the kind of explicit, articulate understanding human intelligence achieves through language and conscious thought. There's even very little in them of the un-conscious, in-articulate but abstract, compositional, combinatorial understanding we (and other animals) show in manipulating our environment, in planning, in social interaction, and in the structure of language.

Now, there are traditions of AI research which do take inspiration from human (and animal) psychology (as opposed to a very old caricature of neurology), and try to actually model things like the structure of language, or planning, or having a body which can be moved in particular ways to interact with physical objects. And while these do make progress, it's a hell of a lot slower than the progress in systems which are just doing reflex action. That might change! There could be a great wave of incredible breakthroughs in AI (not ML) just around the corner, to the point where it will make sense to think about robots actually driving shipping trucks coast to coast, and so forth. Right now, not only is really autonomous AI beyond our grasp, we don't even have a good idea of what we're missing.

In the meanwhile, though, lots of people will sell their learning machines as though they were real AI, with human-style competences, and this will lead to a lot of mischief and (perhaps unintentional) fraud, as the machines get deployed in circumstances where their performance just won't be anything like what's intended. I half suspect that the biggest economic consequence of "AI" for the foreseeable future is that companies will be busy re-engineering human systems --- warehouses and factories, but also hospitals, schools and streets --- so to better accommodate their machines.

So, to sum up:

  • The "big data" point is that there's a huge opportunity for the Fund, the Bank, and their kin to really expand the data on which they base their analyses and decisions, even if you keep using the same sorts of models.
  • The "machine learning" point is that there's a tremendous opportunity to use more flexible models, which do a better job of capturing economic, or political-economic, reality.
  • The "AI" point is that artificial intelligence is the technology of the future, and always will be.

The Dismal Science; Enigmas of Chance

  1. Had there been infinite time, I like to think I'd have remembered that Haavelmo saw this gap very clearly, back in the day. Fortunately, J. W. Mason has a great post on this.^

  2. The classic paper on this, by, inter alia, one of the inventors of neural networks, was called "What the frog's eye tells the frog's brain". This showed how, already in the retina, the frog's nervous system picked out small-dark-dots-moving-erratically. In the natural environment, these would usually be flies or other frog-edible insects.^

  3. Distinguishing between "competence" and "performance" in this way goes back, in cognitive science, at least to Noam Chomsky; I don't know whether Uncle Noam originated the distinction.^

  4. The fact that I need a caveat-phrase like this is an indication of just how little we understand why some of our systems work as well as they do, which in turn should be an indication that nobody has any business making predictions about how quickly they'll advance.^

Continue Reading…


Read More

Data over Space and Time, Lectures 9--13: Filtering, Fourier Analysis, African Population and Slavery, Linear Generative Models

I have fallen behind on posting announcements for the lectures, and I don't feel like writing five of these at once (*). So I'll just list them:

  1. Separating Signal and Noise with Linear Methods (a.k.a. the Wiener filter and seasonal adjustment; .Rmd)
  2. Fourier Methods I (a.k.a. a child's primer of spectral analysis; .Rmd)
  3. Midterm review
  4. Guest lecture by Prof. Patrick Manning: "African Population and Migration: Statistical Estimates, 1650--1900" [PDF handout]
  5. Linear Generative Models for Time Series (a.k.a. the eigendecomposition of the evolution operator is the source of all knowledge; .Rmd)
  6. Linear Generative Models for Spatial and Spatio-Temporal Data (a.k.a. conditional and simultaneous autoregressions; .Rmd)

*: Yes, this is a sign that I need to change my workflow. Several readers have recommended Blogdown, which looks good, but which I haven't had a chance to try out yet.

Corrupting the Young; Enigmas of Chance

Continue Reading…


Read More

Young Investigator Special Competition for Time-Sharing Experiment for the Social Sciences

Sociologists Jamie Druckman and Jeremy Freese write:

Time-Sharing Experiments for the Social Sciences is Having A Special Competition for Young Investigators

Time-sharing Experiments for the Social Sciences (TESS) is an NSF-funded initiative. Investigators propose survey experiments to be fielded using a nationally representative Internet platform via NORC’s AmeriSpeak Panel (see http:/ for more information). While anyone can submit a proposal to TESS at any time through our regular mechanism, we are having a Special Competition for Young Investigators. Graduate students and individuals who received their PhD in 2016 or after are eligible.

To give some examples of experiments we’ve done: one TESS experiment showed that individuals are more likely to support a business refusing service to a gay couple versus an interracial couple, but were no more supportive of religious reasons for doing so versus nonreligious reasons. Another experiment found that participants were more likely to attribute illnesses of obese patients as due to poor lifestyle choices and of non-obese patients to biological factors, which, in turn, resulted in participants being less sympathetic to overweight patients—especially when patients are female. TESS has also fielded an experiment about whether the opinions of economists influence public opinion on different issues, and the study found that they do on relatively technical issues but not so much otherwise.

The proposals that win our Special Competition will be able to be fielded at up to twice the size of a regular TESS study. We will begin accepting proposals for the Special Competition on January 1, 2019, and the deadline is March 1, 2019. Full details about the competition are available at

The post Young Investigator Special Competition for Time-Sharing Experiment for the Social Sciences appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

Magister Dixit

“The value is not in software, the value is in data, and this is really important for every single company, that they understand what data they’ve got.” John Straw

Continue Reading…


Read More

Whats new on arXiv

DeFind: A Protege Plugin for Computing Concept Definitions in EL Ontologies

We introduce an extension to the Protege ontology editor, which allows for discovering concept definitions, which are not explicitly present in axioms, but are logically implied by an ontology. The plugin supports ontologies formulated in the Description Logic EL, which underpins the OWL 2 EL profile of the Web Ontology Language and despite its limited expressiveness captures most of the biomedical ontologies published on the Web. The developed tool allows to verify whether a concept can be defined using a vocabulary of interest specified by a user. In particular, it allows to decide whether some vocabulary items can be omitted in a formulation of a complex concept. The corresponding definitions are presented to the user and are provided with explanations generated by an ontology reasoner.

Parametrized Deep Q-Networks Learning: Reinforcement Learning with Discrete-Continuous Hybrid Action Space

Most existing deep reinforcement learning (DRL) frameworks consider either discrete action space or continuous action space solely. Motivated by applications in computer games, we consider the scenario with discrete-continuous hybrid action space. To handle hybrid action space, previous works either approximate the hybrid space by discretization, or relax it into a continuous set. In this paper, we propose a parametrized deep Q-network (P- DQN) framework for the hybrid action space without approximation or relaxation. Our algorithm combines the spirits of both DQN (dealing with discrete action space) and DDPG (dealing with continuous action space) by seamlessly integrating them. Empirical results on a simulation example, scoring a goal in simulated RoboCup soccer and the solo mode in game King of Glory (KOG) validate the efficiency and effectiveness of our method.

Temporal Convolutional Memory Networks for Remaining Useful Life Estimation of Industrial Machinery

Accurately estimating the remaining useful life (RUL) of industrial machinery is beneficial in many real-world applications. Estimation techniques have mainly utilized linear models or neural network based approaches with a focus on short term time dependencies. This paper introduces a system model that incorporates temporal convolutions with both long term and short term time dependencies. The proposed network learns salient features and complex temporal variations in sensor values, and predicts the RUL. A data augmentation method is used for increased accuracy. The proposed method is compared with several state-of-the-art algorithms on publicly available datasets. It demonstrates promising results, with superior results for datasets obtained from complex environments.

Building Dynamic Knowledge Graphs from Text using Machine Reading Comprehension

We propose a neural machine-reading model that constructs dynamic knowledge graphs from procedural text. It builds these graphs recurrently for each step of the described procedure, and uses them to track the evolving states of participant entities. We harness and extend a recently proposed machine reading comprehension (MRC) model to query for entity states, since these states are generally communicated in spans of text and MRC models perform well in extracting entity-centric spans. The explicit, structured, and evolving knowledge graph representations that our model constructs can be used in downstream question answering tasks to improve machine comprehension of text, as we demonstrate empirically. On two comprehension tasks from the recently proposed PROPARA dataset (Dalvi et al., 2018), our model achieves state-of-the-art results. We further show that our model is competitive on the RECIPES dataset (Kiddon et al., 2015), suggesting it may be generally applicable. We present some evidence that the model’s knowledge graphs help it to impose commonsense constraints on its predictions.

Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms

Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm PAM, partitioning around medoids, also known as k-medoids. In Euclidean geometry the mean–as used in k-means–is a good estimator for the cluster center, but this does not hold for arbitrary dissimilarities. PAM uses the medoid instead, the object with the smallest dissimilarity to all others in the cluster. This notion of centrality can be used with any (dis-)similarity, and thus is of high relevance to many domains such as biology that require the use of Jaccard, Gower, or even more complex distances. A key issue with PAM is, however, its high run time cost. In this paper, we propose modifications to the PAM algorithm where at the cost of storing O(k) additional values, we can achieve an O(k)-fold speedup in the second (‘SWAP’) phase of the algorithm, but will still find the same results as the original PAM algorithm. If we slightly relax the choice of swaps performed (while retaining comparable quality), we can further accelerate the algorithm by performing up to k swaps in each iteration. We also show how the CLARA and CLARANS algorithms benefit from this modification. In experiments on real data with k=100, we observed a 200 fold speedup compared to the original PAM SWAP algorithm, making PAM applicable to larger data sets, and in particular to higher k.

Can evolution paths be explained by chance alone?

We propose a purely probabilistic model to explain the evolution path of a population maximum fitness. We show that after n births in the population there are about \ln n upwards jumps. This is true for any mutation probability and any fitness distribution and therefore suggests a general law for the number of upwards jumps. Simulations of our model show that a typical evolution path has first a steep rise followed by long plateaux. Moreover, independent runs show parallel paths. This is consistent with what was observed by Lenski and Travisano (1994) in their bacteria experiments.

One-Shot PIR: Refinement and Lifting

We study a class of private information retrieval (PIR) methods that we call one-shot schemes. The intuition behind one-shot schemes is the following. The user’s query is regarded as a dot product of a query vector and the message vector (database) stored at multiple servers. Privacy, in an information theoretic sense, is then achieved by encrypting the query vector using a secure linear code, such as secret sharing. Several PIR schemes in the literature, in addition to novel ones constructed here, fall into this class. One-shot schemes provide an insightful link between PIR and data security against eavesdropping. However, their download rate is not optimal, i.e., they do not achieve the PIR capacity. Our main contribution is two transformations of one-shot schemes, which we call refining and lifting. We show that refining and lifting one-shot schemes gives capacity-achieving schemes for the cases when the PIR capacity is known. In the other cases, when the PIR capacity is still unknown, refining and lifting one-shot schemes gives the best download rate so far.

Estimating Information Flow in Neural Networks

We study the flow of information and the evolution of internal representations during deep neural network (DNN) training, aiming to demystify the compression aspect of the information bottleneck theory. The theory suggests that DNN training comprises a rapid fitting phase followed by a slower compression phase, in which the mutual information I(X;T) between the input X and internal representations T decreases. Several papers observe compression of estimated mutual information on different DNN models, but the true I(X;T) over these networks is provably either constant (discrete X) or infinite (continuous X). This work explains the discrepancy between theory and experiments, and clarifies what was actually measured by these past works. To this end, we introduce an auxiliary (noisy) DNN framework for which I(X;T) is a meaningful quantity that depends on the network’s parameters. This noisy framework is shown to be a good proxy for the original (deterministic) DNN both in terms of performance and the learned representations. We then develop a rigorous estimator for I(X;T) in noisy DNNs and observe compression in various models. By relating I(X;T) in the noisy DNN to an information-theoretic communication problem, we show that compression is driven by the progressive clustering of hidden representations of inputs from the same class. Several methods to directly monitor clustering of hidden representations, both in noisy and deterministic DNNs, are used to show that meaningful clusters form in the T space. Finally, we return to the estimator of I(X;T) employed in past works, and demonstrate that while it fails to capture the true (vacuous) mutual information, it does serve as a measure for clustering. This clarifies the past observations of compression and isolates the geometric clustering of hidden representations as the true phenomenon of interest.

Unsupervised Neural Multi-document Abstractive Summarization

Abstractive summarization has been studied using neural sequence transduction methods with datasets of large, paired document-summary examples. However, such datasets are rare and the models trained from them do not generalize to other domains. Recently, some progress has been made in learning sequence-to-sequence mappings with only unpaired examples. In our work, we consider the setting where there are only documents and no summaries provided and propose an end-to-end, neural model architecture to perform unsupervised abstractive summarization. Our proposed model consists of an auto-encoder trained so that the mean of the representations of the input documents decodes to a reasonable summary. We consider variants of the proposed architecture and perform an ablation study to show the importance of specific components. We apply our model to the summarization of business and product reviews and show that the generated summaries are fluent, show relevancy in terms of word-overlap, representative of the average sentiment of the input documents, and are highly abstractive compared to baselines.

Explaining Black Boxes on Sequential Data using Weighted Automata

Understanding how a learned black box works is of crucial interest for the future of Machine Learning. In this paper, we pioneer the question of the global interpretability of learned black box models that assign numerical values to symbolic sequential data. To tackle that task, we propose a spectral algorithm for the extraction of weighted automata (WA) from such black boxes. This algorithm does not require the access to a dataset or to the inner representation of the black box: the inferred model can be obtained solely by querying the black box, feeding it with inputs and analyzing its outputs. Experiments using Recurrent Neural Networks (RNN) trained on a wide collection of 48 synthetic datasets and 2 real datasets show that the obtained approximation is of great quality.

Graph HyperNetworks for Neural Architecture Search

Neural architecture search (NAS) automatically finds the best task-specific neural network topology, outperforming many manual architecture designs. However, it can be prohibitively expensive as the search requires training thousands of different networks, while each can last for hours. In this work, we propose the Graph HyperNetwork (GHN) to amortize the search cost: given an architecture, it directly generates the weights by running inference on a graph neural network. GHNs model the topology of an architecture and therefore can predict network performance more accurately than regular hypernetworks and premature early stopping. To perform NAS, we randomly sample architectures and use the validation accuracy of networks with GHN generated weights as the surrogate search signal. GHNs are fast — they can search nearly 10 times faster than other random search methods on CIFAR-10 and ImageNet. GHNs can be further extended to the anytime prediction setting, where they have found networks with better speed-accuracy tradeoff than the state-of-the-art manual designs.

Measuring Swampiness: Quantifying Chaos in Large Heterogeneous Data Repositories

As scientific data repositories and filesystems grow in size and complexity, they become increasingly disorganized. The coupling of massive quantities of data with poor organization makes it challenging for scientists to locate and utilize relevant data, thus slowing the process of analyzing data of interest. To address these issues, we explore an automated clustering approach for quantifying the organization of data repositories. Our parallel pipeline processes heterogeneous filetypes (e.g., text and tabular data), automatically clusters files based on content and metadata similarities, and computes a novel ‘cleanliness’ score from the resulting clustering. We demonstrate the generation and accuracy of our cleanliness measure using both synthetic and real datasets, and conclude that it is more consistent than other potential cleanliness measures.

Mixture of Expert/Imitator Networks: Scalable Semi-supervised Learning Framework

The current success of deep neural networks (DNNs) in an increasingly broad range of tasks for the artificial intelligence strongly depends on the quality and quantity of labeled training data. In general, the scarcity of labeled data, which is often observed in many natural language processing tasks, is one of the most important issues to be addressed. Semi-supervised learning (SSL) is a promising approach to overcome this issue by incorporating a large amount of unlabeled data. In this paper, we propose a novel scalable method of SSL for text classification tasks. The unique property of our method, Mixture of Expert/Imitator Networks, is that imitator networks learn to ‘imitate’ the estimated label distribution of the expert network over the unlabeled data, which potentially contributes as a set of features for the classification. Our experiments demonstrate that the proposed method consistently improves the performance of several types of baseline DNNs. We also demonstrate that our method has the more data, better performance property with promising scalability to the unlabeled data.

Categorical Aspects of Parameter Learning

Parameter learning is the technique for obtaining the probabilistic parameters in conditional probability tables in Bayesian networks from tables with (observed) data — where it is assumed that the underlying graphical structure is known. There are basically two ways of doing so, referred to as maximal likelihood estimation (MLE) and as Bayesian learning. This paper provides a categorical analysis of these two techniques and describes them in terms of basic properties of the multiset monad M, the distribution monad D and the Giry monad G. In essence, learning is about the reltionships between multisets (used for counting) on the one hand and probability distributions on the other. These relationsips will be described as suitable natural transformations.

A Geometric Analysis of Time Series Leading to Information Encoding and a New Entropy Measure

A time series is uniquely represented by its geometric shape, which also carries information. A time series can be modelled as the trajectory of a particle moving in a force field with one degree of freedom. The force acting on the particle shapes the trajectory of its motion, which is made up of elementary shapes of infinitesimal neighborhoods of points in the trajectory. It has been proved that an infinitesimal neighborhood of a point in a continuous time series can have at least 29 different shapes or configurations. So information can be encoded in it in at least 29 different ways. A 3-point neighborhood (the smallest) in a discrete time series can have precisely 13 different shapes or configurations. In other words, a discrete time series can be expressed as a string of 13 symbols. Across diverse real as well as simulated data sets it has been observed that 6 of them occur more frequently and the remaining 7 occur less frequently. Based on frequency distribution of 13 configurations or 13 different ways of information encoding a novel entropy measure, called semantic entropy (E), has been defined. Following notion of power in Newtonian mechanics of the moving particle whose trajectory is the time series, a notion of information power (P) has been introduced for time series. E/P turned out to be an important indicator of synchronous behaviour of time series as observed in epileptic EEG signals.

Ineffectiveness of Dictionary Coding to Infer Predictability Limits of Human Mobility
Neural Network based classification of bone metastasis by primary cacinoma
DeepWeeds: A Multiclass Weed Species Image Dataset for Deep Learning
Automatic Segmentation of Thoracic Aorta Segments in Low-Dose Chest CT
UOLO – automatic object detection and segmentation in biomedical images
Rate Distortion For Model Compression: From Theory To Practice
Is PGD-Adversarial Training Necessary? Alternative Training via a Soft-Quantization Network with Noisy-Natural Samples Only
Unpaired High-Resolution and Scalable Style Transfer Using Generative Adversarial Networks
CRH: A Simple Benchmark Approach to Continuous Hashing
Image Super-Resolution Using VDSR-ResNeXt and SRCGAN
Computational ghost imaging using a field-programmable gate array
A Novel Domain Adaptation Framework for Medical Image Segmentation
A Resource Allocation based Approach for Corporate Mobility as a Service
A Data-Driven Framework for Assessing Cold Load Pick-up Demand in Service Restoration
Learning Optimal Deep Projection of $^{18}$F-FDG PET Imaging for Early Differential Diagnosis of Parkinsonian Syndromes
InfiNet: Fully Convolutional Networks for Infant Brain MRI Segmentation
Bottom-up Attention, Models of
Inventory Balancing with Online Learning
Dirichlet conditions in Poincaré-Sobolev inequalities: the sub-homogeneous case
The highD Dataset: A Drone Dataset of Naturalistic Vehicle Trajectories on German Highways for Validation of Highly Automated Driving Systems
Mean field voter model on networks and multi-variate beta distribution
On the sensitivity analysis of energy quanto options
Subordinators which are infinitely divisible w.r.t. time: Construction, properties, and simulation of max-stable sequences and infinitely divisible laws
Wind Power Persistence is Governed by Superstatistic
Finite sample performance of linear least squares estimation
Thresholds quantifying proportionality criteria for election methods
Smart Load Node for Non-Smart Load under Smart Grid Paradigm
Regression Based Approach for Measurement of Current in Single-Phase Smart Energy Meter
SmartPM: Automatic Adaptation of Dynamic Processes at Run-Time
On mixture representations for the generalized Linnik distribution and their applications in limit theorems
BSDEs driven by $|z|^2/y$ and applications
Performance Analysis of Large Intelligence Surfaces (LISs): Asymptotic Data Rate and Channel Hardening Effects
Linear response and moderate deviations: hierarchical approach. IV
Spherical Regression under Mismatch Corruption with Application to Automated Knowledge Translation
Long-Duration Autonomy for Small Rotorcraft UAS including Recharging
Non vanishing of theta functions and sets of small multiplicative energy
Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience
Linear Program Reconstruction in Practice
Almost Complete Graphs and the Kruskal Katona Theorem
Improving Generalization of Sequence Encoder-Decoder Networks for Inverse Imaging of Cardiac Transmembrane Potential
Does Haze Removal Help CNN-based Image Classification?
Topology of Z_3 equivariant Hilbert schemes
Policy Transfer with Strategy Optimization
Global Convergence of EM Algorithm for Mixtures of Two Component Linear Regression
Relative compression of trajectories
A Word-Complexity Lexicon and A Neural Readability Ranking Model for Lexical Simplification
Shell Tableaux: A set partition analogue of vacillating tableaux
Topological Inference of Manifolds with Boundary
A geometrically converging dual method for distributed optimization over time-varying graphs
Estimating Robot Strengths with Application to Selection of Alliance Members in FIRST Robotics Competitions
A Model for Auto-Programming for General Purposes
Hierarchical Game-Theoretic Planning for Autonomous Vehicles
$C_{2k}$-saturated graphs with no short odd cycles
CPNet: A Context Preserver Convolutional Neural Network for Detecting Shadows in Single RGB Images
Pose Estimation for Objects with Rotational Symmetry
Stabilization and manipulation of multi-spin states in quantum dot time crystals with Heisenberg interactions
Cloud Detection Algorithm for Remote Sensing Images Using Fully Convolutional Neural Networks
Learning to Globally Edit Images with Textual Description
Point Cloud GAN
Core Influence Mechanism on Vertex-Cover Problem through Leaf-Removal-Core Breaking
Deep learning based cloud detection for remote sensing images by the fusion of multi-scale convolutional features
On the null structure of bipartite graphs without cycles of length a multiple of 4
Towards Provably Safe Mixed Transportation Systems with Human-driven and Automated Vehicles
Extremes of branching Ornstein-Uhlenbeck processes
Efficient Multi-level Correlating for Visual Tracking
On the Rate of Convergence for a Characteristic of Multidimensional Birth-Death Process
Ultrafast cryptography with indefinitely switchable optical nanoantennas
Contagions in Social Networks: Effects of Monophilic Contagion, Friendship Paradox and Reactive Networks
Quantum simulation of clustered photosynthetic light harvesting in a superconducting quantum circuit
Diffusive spin-orbit torque at a surface of topological insulator
Approximating Pairwise Correlations in the Ising Model
Characterising epithelial tissues using persistent entropy
Time Synchronization in Wireless Sensor Networks based on Newtons Adaptive Algorithm
Delay Regulated Explosive Synchronization in Multiplex Networks
Error estimation at the information reconciliation stage of quantum key distribution
Optimal Temperature Spacing for Regionally Weight-preserving Tempering
Nesterov Acceleration of Alternating Least Squares for Canonical Tensor Decomposition
Overview of CAIL2018: Legal Judgment Prediction Competition
Exploiting Semantics in Adversarial Training for Image-Level Domain Adaptation
Using generalized estimating equations to estimate nonlinear models with spatial data
On Greedy and Strategic Evaders in Sequential Interdiction Settings with Incomplete Information
Equivalent Constraints for Two-View Geometry: Pose Solution/Pure Rotation Identification and 3D Reconstruction
Attention Driven Person Re-identification
Understanding Crosslingual Transfer Mechanisms in Probabilistic Topic Modeling
Hybrid Building/Floor Classification and Location Coordinates Regression Using A Single-Input and Multi-Output Deep Neural Network for Large-Scale Indoor Localization Based on Wi-Fi Fingerprinting
Generalized tensor equations with leading structured tensors
Linearizable Replicated State Machines with Lattice Agreement
Further study on tensor absolute value equations
Optimal Control of DERs in ADN under Spatial and Temporal Correlated Uncertainties
Embedded deep learning in ophthalmology: Making ophthalmic imaging smarter
A space-time pseudospectral discretization method for solving diffusion optimal control problems with two-sided fractional derivatives
Optimal Time Scheduling Scheme for Wireless Powered Ambient Backscatter Communication in IoT Network
A New [Combinatorial] Proof of the Commutativity of Matching Polynomials for Cycles
Resource Allocation in IoT networks using Wireless Power Transfer
Deep Learning-Based Channel Estimation
Power Flow as Intersection of Circles: A new Fixed Point Method
Towards Formal Definitions of Blameworthiness, Intention, and Moral Responsibility
Computing the partition function of the Sherrington-Kirkpatrick model is hard on average
Optimal Evidence Accumulation on Social Networks
Group Inverse of the Laplacian of Connections of Networks
Evacuation simulation considering action of the guard in an artificial attack
No-reference Image Denoising Quality Assessment
Two Can Play That Game: An Adversarial Evaluation of a Cyber-alert Inspection System
Porosity Amount Estimation in Stones Based on Combination of One Dimensional Local Binary Patterns and Image Normalization Technique
A Transformation-Proximal Bundle Algorithm for Solving Large-Scale Multistage Adaptive Robust Optimization Problems
Massively Parallel Hyperparameter Tuning
Uniform Convergence Rate of the Kernel Density Estimator Adaptive to Intrinsic Dimension
End-to-End Service Level Agreement Specification for IoT Applications
False Data Injection Cyber-Attack Detection
Enhanced Energy Management System with Corrective Transmission Switching Strategy – Part I: Methodology
Enhanced Energy Management System with Corrective Transmission Switching Strategy – Part II: Results and Discussion
Varifocal-Net: A Chromosome Classification Approach using Deep Convolutional Networks
Social Media Brand Engagement as a Proxy for E-commerce Activities: A Case Study of Sina Weibo and JD
Robust Model Predictive Control of Irrigation Systems with Active Uncertainty Learning and Data Analytics
Delay-Constrained Covert Communications with A Full-Duplex Receiver
The relationship between graphs and Nichols braided Lie algebras
Comparison Detector: A novel object detection method for small dataset
Approximating optimal transport with linear programs
Incorporating Diversity into Influential Node Mining
Rainbow triangles in arc-colored digraphs
Empirical determination of the optimum attack for fragmentation of modular networks
Perceptual Image Quality Assessment through Spectral Analysis of Error Representations
Efficient Reconstructions of Common Era Climate via Integrated Nested Laplace Approximations
DDSL: Efficient Subgraph Listing on Distributed and Dynamic Graphs
Sequential Change-point Detection for High-dimensional and non-Euclidean Data
Learning to Sketch with Deep Q Networks and Demonstrated Strokes
Finding Similar Medical Questions from Question Answering Websites
Kasteleyn operators from mirror symmetry
Theoretical Guarantees of Transfer Learning
Lung Structures Enhancement in Chest Radiographs via CT based FCNN Training
Convex Hull Approximation of Nearly Optimal Lasso Solutions
Modeling Multimodal Dynamic Spatiotemporal Graphs
BLEU is Not Suitable for the Evaluation of Text Simplification

Continue Reading…


Read More

survHE new release

(This article was first published on R on Gianluca Baio, and kindly contributed to R-bloggers)

I have just submitted a revised version of survHE on CRAN — it should be up very shortly. This will be version 1.0.64 and its main feature is a major restructuring in the way the rstan/HMC stuff works.

Basically, this is due to a change in the default C++ compiler. I don’t think much will change in terms of how survHE works when running full Bayesian models using HMC, but R now compiles it without problems. After the advice of Ben Goodrich, I have also modified the package so that it compiles the Stan programs serially as opposed to gluing them all together, which should optmise the use of the memory.

To leave a comment for the author, please follow the link and comment on their blog: R on Gianluca Baio. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

October 18, 2018

Maryland's Bridge Safety, reported using R

A front-page story in the Baltimore Sun reported last week on the state of the bridges in Maryland. Among the report's findings:

  • 5.4% of bridges are classified in "poor" or "structurally deficient" condition
  • 13% of bridges in the city of Baltimore are in "poor" condition

Baltimore sun

Those findings were the result of analysis of Federal infrastructure data by reporter Christine Zhang. The analysis was performed using R and documented in a Jupyter Notebook published on Github. The raw data included almost 50 variables including type, location, ownership, and inspection dates and ratings, and required a fair bit of processing with tidyverse functions to extract the Marlyland-specific statistics above. The analysis also turned up an unusual owner for one of the bridges: this one — an access road to the Goddard Space Flight Center — is owned by NASA.


You can read the story in the Baltimore Sun, or check out the R analysis in the Github repo linked below.

Github (Baltimore Sun): Maryland bridges analysis (via Sharon Machlis)

Continue Reading…


Read More

Book Memo: “Data Mining for Systems Biology”

Methods and Protocols
This fully updated book collects numerous data mining techniques, reflecting the acceleration and diversity of the development of data-driven approaches to the life sciences. The first half of the volume examines genomics, particularly metagenomics and epigenomics, which promise to deepen our knowledge of genes and genomes, while the second half of the book emphasizes metabolism and the metabolome as well as relevant medicine-oriented subjects. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that is useful for getting optimal results. Authoritative and practical, Data Mining for Systems Biology: Methods and Protocols, Second Edition serves as an ideal resource for researchers of biology and relevant fields, such as medical, pharmaceutical, and agricultural sciences, as well as for the scientists and engineers who are working on developing data-driven techniques, such as databases, data sciences, data mining, visualization systems, and machine learning or artificial intelligence that now are central to the paradigm-altering discoveries being made with a higher frequency.

Continue Reading…


Read More

R Packages worth a look

Creates an Interactive Tree Structure of a Directory (directotree)
Represents the content of a directory as an interactive collapsible tree. Offers the possibility to assign a text (e.g., a ‘Readme.txt’) to each folder …

Approximate the Variance of the Horvitz-Thompson Total Estimator (UPSvarApprox)
Variance approximations for the Horvitz-Thompson total estimator in Unequal Probability Sampling using only first-order inclusion probabilities. See Ma …

Presentations in the REPL (REPLesentR)
Create presentations and display them inside the R ‘REPL’ (Read-Eval-Print loop), aka the R console. Presentations can be written in ‘RMarkdown’ or any …

Continue Reading…


Read More

Ethics in statistical practice and communication: Five recommendations.

I recently published an article summarizing some of my ideas on ethics in statistics, going over these recommendations:

1. Open data and open methods,

2. Be clear about the information that goes into statistical procedures,

3. Create a culture of respect for data,

4. Publication of criticisms,

5. Respect the limitations of statistics.

The full article is here.

The post Ethics in statistical practice and communication: Five recommendations. appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

How to Solve the ModelOps Challenge

A recent study shows that while 85% believe data science will allow their companies to obtain or sustain a competitive advantage, only 5% are using data science extensively. Join this webinar, Nov 14, to find out why.

Continue Reading…


Read More

Distilled News

Keras Transfer Learning For Beginners

This blog consists of 3 parts:
1. What is transfer learning ?
2. Why does transfer learning work so well ?
3. Coding your first image recognizer using transfer learning.

What’s New in Deep Learning Research: Facebook Meta-Embeddings Allow NLP Models to Choose Their Own Architecture

Word embeddings have revolutionized the world of natural language processing(NLP). Conceptually, word embeddings are language modeling methods that map phrases or words in a sentence to vectors and numbers. One of the first steps in any NLP application is to determine what type of word embedding algorithm is going to be used. Typically, NLP models resort to pretrained word embedding algorithm such as Word2Vec, Glove or FastText. While that approach is relatively simple, it also results highly inefficient as is near to impossible to determine what word embedding will perform better as the NLP model evolves. What if the NLP model itself could select the best word-embedding for a given context? In a recent paper, researchers from Facebook’s Artificial Intelligence Research Lab(FAIR), proposed a method that allow NLP models to dynamically select a word-embedding algorithm that performs the best on a given environment. Dynamic Meta-Embeddings is a technique that combine different word-embedding models in an ensemble model and allows a NLP algorithm to choose what embedding to use based on their performance. Facebook’s technique, essentially, delays the selection of an embedding algorithm from design time to runtime based on the specific behavior of the ensemble.

Building a Convolutional Neural Network (CNN) in Keras

Deep Learning is becoming a very popular subset of machine learning due to its high level of performance across many types of data. A great way to use deep learning to classify images is to build a convolutional neural network (CNN). The Keras library in Python makes it pretty simple to build a CNN. Computers see images using pixels. Pixels in images are usually related. For example, a certain group of pixels may signify an edge in an image or some other pattern. Convolutions use this to help identify images.

Building a Logistic Regression in Python

Suppose you are given the scores of two exams for various applicants and the objective is to classify the applicants into two categories based on their scores i.e, into Class-1 if the applicant can be admitted to the university or into Class-0 if the candidate can’t be given admission. Can this problem be solved using Linear Regression? Let’s check.

Automated Hyper-parameter Optimization in SageMaker

So you’ve built your model and are getting sensible results, and are now ready to squeeze out as much performance as possible. One possibility is doing Grid Search, where you try every possible combination of hyper-parameters and choose the best one. That works well if your number of choices are relatively small, but what if you have a large number of hyper-parameters, and some are continuous values that might span several orders of magnitude? Random Search works pretty well to explore the parameter space without committing to exploring all of it, but is randomly groping in the dark the best we can do? Of course not. Bayesian Optimization is a technique for optimizing a function when making sequential decisions. In this case, we’re trying to maximize performance by choosing hyper-parameter values. This sequential decision framework means that the hyper-parameters you choose for the next step will be influenced by the performance of all the previous attempts. Bayesian Optimization makes principled decisions about how to balance exploring new regions of the parameter space vs exploiting regions that are known to perform well. This is all to say that it’s generally much more efficient to use Bayesian Optimization than alternatives like Grid Search and Random Search.

Stacked Neural Networks for Prediction

Machine learning and deep learning have found their place in financial institution for their power in predicting time series data with high degrees of accuracy. There is a lot of research going on to improve models so that they can predict data will higher degree of accuracy. This post is a write up about my project AlphaAI, which is a stacked neural network architecture that predicts the stock prices of various companies. This project is also one of the finalists at iNTUtion 2018, a hackathon for undergraduates here in Singapore.

Who am I connected to?

A problem that arise a lot when you play with data is to figure out how things are connected. It could be for example to determine from all your friends, and your friends connection, and your friends friends connections, … to whom you are directly or indirectly connected, or how many degrees of separation you have with such and such connection. Luckily there are some tools at your disposal to perform such analysis. Those tools comes under the umbrella of Network Theory and I will cover some basic tricks in this post.

Curiosity-Driven Learning made easy Part I

In the recent years, we’ve seen a lot of innovations in Deep Reinforcement Learning. From DeepMind and the Deep Q learning architecture in 2014 to OpenAI playing Dota2 with OpenAI five in 2018, we live in an exciting and promising moment. And today we’ll learn about Curiosity-Driven Learning, one of the most exciting and promising strategy in Deep Reinforcement Learning.

AI SERIES: Looking for a ‘Cognitive Operating System’

AI is a field of study that seeks to understand, develop and implement intelligent behavior into hardware and software systems to mimic and expand human-like abilities. To deliver its promise, AI implements various techniques in the field of Machine Learning (ML), which is a subset of studies that focus on developing software systems with the ability to learn new skills from experience, by trial and error or by applying known rules. Deep Learning (DL), is so far, the technique in Machine Learning that, by a wide margin, has delivered the most exciting results and practical use cases in domains such as speech and image recognition, language translation and plays a role in a wide range of current AI applications.

The Evolution of Analytics with Data

We have made a tremendous progress in the field of Information & Technology in recent times. Some of the revolutionary feats achieved in the tech-ecosystem are really worth commendable. Data and Analytics have been the most commonly-used words in the last decade or two. As such, it’s important to know why they are inter-related, what roles in the market are currently evolving and how they are reshaping businesses.

Optimization: Loss Function Under the Hood (Part III)

Continuing this journey, I have discussed the loss function and optimization process of linear regression at Part I, logistic regression at part II, and this time, we are heading to Support Vector Machine.

Loss Function (Part II): Logistic Regression

This series aims to explain loss functions of a few widely-used supervised learning models, and some options of optimization algorithms. In part I, I walked through the optimization process of Linear Regression in details by using Gradient Descent and using Least Squared Error as loss function. In this part, I will move to Logistic Regression.

Optimization: Loss Function Under the Hood (Part I)

When building a machine learning model, some questions similar like these usually comes into my mind: How does a model being optimized? Why does Model A outperform Model B? To answer them, I think one of entry points can be understanding loss functions of different models, and furthermore, being able to choose an appropriate loss function or self-define a loss function based on the goal of the project and the tolerance of error type. I will post a series of blogs discussing loss functions and optimization algorithms of a few common supervised learning models. I will try to explain in a way that is friendly to the audience who don’t have a strong mathematical background. Let’s start from Part I, Linear Regression.

Big Data, the why, how and what – A thought Process and Architecture

One of the most common questions I get when talking with customers, is how they are able to set up a good big data architecture that will allow them to process all their existing data. With as an ultimate goal to perform advanced analytics and AI on top of it, to extract insights that will allow them to stay relevant in the ever faster evolving world of today. To tackle this issue, I always first start by asking them what their understanding ‘Big Data’ is, because one customer is not the other. One might think that Big Data is just the way they are able to process all their different excel files, while another might think that it is the holy grail for all their projects and intelligence. Well in this article, I want to explain you what Big Data means to me and provide you a thought process that can help you in defining your Big Data strategy for your organization.

Understanding and visualizing DenseNets

Counter-intuitively, by connecting this way DenseNets require fewer parameters than an equivalent traditional CNN, as there is no need to learn redundant feature maps. Furthermore, some variations of ResNets have proven that many layers are barely contributing and can be dropped. In fact, the number of parameters of ResNets are big because every layer has its weights to learn. Instead, DenseNets layers are very narrow (e.g. 12 filters), and they just add a small set of new feature-maps. Another problem with very deep networks was the problems to train, because of the mentioned flow of information and gradients. DenseNets solve this issue since each layer has direct access to the gradients from the loss function and the original input image.

ResNet on CIFAR10

ImageNet dataset consist on a set of images (the authors used 1.28 million training images, 50k validation images and 100k test images) of size (224×224) belonging to 1000 different classes. However, CIFAR10 consist on a different set of images (45k training images, 5k validation images and 10k testing images) distributed into just 10 different classes. Because the sizes of the input volumes (images) are completely different, it is easy to think that the same structure will not be suitable to train on this dataset. We cannot perform the same reductions on the dataset without having dimensionality mismatches. We are going to follow the solution the authors give to ResNets to train on CIFAR10, which are also tricky to follow like for ImageNet dataset.

Understanding and visualizing ResNets

Researchers observed that it makes sense to affirm that ‘the deeper the better’ when it comes to convolutional neural networks. This makes sense, since the models should be more capable (their flexibility to adapt to any space increase because they have a bigger parameter space to explore). However, it has been noticed that after some depth, the performance degrades. This was one of the bottlenecks of VGG. They couldn’t go as deep as wanted, because they started to lose generalization capability.

Text Analytics APIs, Part 2: The Smaller Players

It seems like there’s yet another cloud-based text analytics Application Programming Interface (API) on the market every few weeks. If you’re interested in building an application using these kinds of services, how do you decide which API to go for? In the previous post in this series, we looked at the text analytics APIs from the behemoths in the cloud software world: Amazon, Google, IBM and Microsoft. In this post, we survey sixteen APIs offered by smaller players in the market.

Text Analytics APIs, Part 1: The Bigger Players

If you’re in the market for an off-the-shelf text analytics API, you have a lot of options. You can choose to go with a major player in the software world, for whom each AI-related service is just another entry in their vast catalogues of tools, or you can go for a smaller provider that focusses on text analytics as their core business. In this first of two related posts, we look at what the most prominent software giants have to offer today.

Continue Reading…


Read More

Distilled News

Resource-efficient Machine Learning in 2 KB RAM for the Internet of Things

This paper develops a novel tree-based algorithm, called Bonsai, for efficient prediction on IoT devices – such as those based on the Arduino Uno board having an 8 bit ATmega328P microcontroller operating at 16 MHz with no native floating point support, 2 KB RAM and 32 KB read-only flash. Bonsai maintains prediction accuracy while minimizing model size and prediction costs by: (a) developing a tree model which learns a single, shallow, sparse tree with powerful nodes; (b) sparsely projecting all data into a low-dimensional space in which the tree is learnt; and (c) jointly learning all tree and projection parameters. Experimental results on multiple benchmark datasets demonstrate that Bonsai can make predictions in milliseconds even on slow microcontrollers, can fit in KB of memory, has lower battery consumption than all other algorithms while achieving prediction accuracies that can be as much as 30% higher than stateof- the-art methods for resource-efficient machine learning. Bonsai is also shown to generalize to other resource constrained settings beyond IoT by generating significantly better search results as compared to Bing’s L3 ranker when the model size is restricted to 300 bytes. Bonsai’s code can be downloaded from (BonsaiCode).

The 50 Best Public Datasets for Machine Learning

What are some open datasets for machine learning? After scrapping the web for hours after hours, we have created a great cheat sheet for high quality and diverse machine learning datasets.

Forecasting at Uber: An Introduction

This article is the first in a series dedicated to explaining how Uber leverages forecasting to build better products and services. In recent years, machine learning, deep learning, and probabilistic programming have shown great promise in generating accurate forecasts. In addition to standard statistical algorithms, Uber builds forecasting solutions using these three techniques. Below, we discuss the critical components of forecasting we use, popular methodologies, backtesting, and prediction intervals.

Low-Code Development

As organizations work to automate more knowledge-driven business processes, business and IT need to speak the same language. With complex task management passing through multiple stages and decision points, it takes a lot of effort for subject matter experts to explain processes to someone else, particularly a software developer who may or may not have domain knowledge. When requirements are misunderstood, we usually blame the developer for not listening. But in truth, it’s just as common for the business owner to have struggled to properly explain every concept. Learn more about OpenText AppWorks and low-code application development. The low-code application development platform in OpenText AppWorks gives business owners and subject matter experts a direct role in developing digital business automation. It’s an information-centered approach: Business users can think about the policy or workflow from their perspective, defining the information that needs action rather than working around a rigid process.

Deep Learning in the Trenches: Understanding Inception Network from Scratch

This article focuses on the paper ‘Going deeper with convolutions’ from which the hallmark idea of inception network came out. Inception network was once considered a state-of-the-art deep learning architecture (or model) for solving image recognition and detection problems. It put forward a breakthrough performance on the ImageNet Visual Recognition Challenge (in 2014), which is a reputed platform for benchmarking image recognition and detection algorithms. Along with this, it set off a ton of research in the creation of new deep learning architectures with innovative and impactful ideas. We will go through the main ideas and suggestions propounded in the aforementioned paper and try to grasp the techniques within. In the words of the author: ‘In this paper, we will focus on an efficient deep neural network architecture for computer vision, code named Inception, which derives its name from (…) the famous ‘we need to go deeper’ internet meme.’

Must Read Books for Beginners on Machine Learning and Artificial Intelligence

• Machine Learning Yearning
• Programming Collective Intelligence
• Machine Learning for Hackers
• Machine Learning by Tom M Mitchell
• The Elements of Statistical Learning
• Learning from Data
• Pattern Recognition and Machine Learning
• Natural Language Processing with Python
• Artificial Intelligence: A Modern Approach
• Artificial Intelligence for Humans
• Paradigm of Artificial Intelligence Programming
• Artificial Intelligence: A New Synthesis
• Machine Learning Yearning
• The Singularity is Near
• Life 3.0 – Being Human in the Age of Artificial Intelligence
• The Master Algorithm

Time Series Analysis using R

Learn Time Series Analysis with R along with using a package in R for forecasting to fit the real-time series to match the optimal model.

Introduction To GUI With Tkinter In Python

In this tutorial, you are going to learn how to create GUI apps in Python. You’ll also learn about all the elements needed to create GUI apps in Python.

The Main Approaches to Natural Language Processing Tasks

Let’s have a look at the main approaches to NLP tasks that we have at our disposal. We will then have a look at the concrete NLP tasks we can tackle with said approaches.

Accelerating Your Algorithms in Production With Python and Intel MKL

Numerical algorithms are computationally demanding, which makes performance an important consideration when using Python for machine learning, especially as you move from desktop to production.
In this webinar, we look at:
• Role of productivity and performance for numerical computing and machine learning
• Python algorithm choice and efficient package usage
• Requirements for efficient use of hardware
• NumPy and SciPy performance with the Intel MKL (math kernel library)
• How Intel and ActivePython help you accelerate and scale Python performance

Adversarial Examples, Explained

Deep neural networks – the kind of machine learning models that have recently led to dramatic performance improvements in a wide range of applications – are vulnerable to tiny perturbations of their inputs. We investigate how to deal with these vulnerabilities.

Flight rules for Git

The hard-earned body of knowledge recorded in manuals that list, step-by-step, what to do if X occurs and why. Essentially, they are extremely detailed, scenario-specific standard operating procedures. What to do after you shoot yourself in the foot in interesting ways with Git.

RStudio 1.2 Preview: Stan

We previously discussed improved support in RStudio v1.2 for SQL, D3, Python, and C/C++. Today, we’re excited to announce improved support for the Stan programming language. The Stan programming language makes it possible for researchers and analysts to write high-performance and scalable statistical models.

Building a Personal Computer for AI and Data Science

This is it. Your build guide for constructing your personal home computer for AI and Data Science. There are already a lot of build guides out there for deep learning. Many for Data Science too. And some here and there for reinforcement learning. But very very few for all of them.That’s why I wrote this guide, for people who were, or may at some point be, interested in all of these fields!

Developing Good Twitter Data Visualizations using Matplotlib

In this article, we will learn about how to collect Twitter data and create interesting visualizations in Python. We will briefly explore about how to collect tweets using Tweepy and we will mostly explore about the various Data Visualization techniques for the Twitter data using Matplotlib. Before that, Data Visualization and the overall statistical process that enables it will be explained.

Review: Artificial Intelligence in 2018

Artificial Intelligence is not a buzzword anymore. As of 2018, it is a well-developed branch of Big Data analytics with multiple applications and active projects. Here is a brief review of the topic.
AI is the umbrella term for various approaches to big data analysis, like machine learning models and deep learning networks. We have recently demystified the terms of AI, ML and DL and the differences between them, so feel free to check this up. In short, AI algorithms are various data science mathematical models that help improve the outcome of the certain process or automate some routine task
However, the technology has now matured enough to move these data science advancements from the pilot projects phase to the stage of production-ready deployment at scale. Below is the overview of various aspects of AI technology adoption across the IT industry in 2018.
We take a look at the parameters like:
• the most widely used types of AI algorithms,
• the way the companies apply the AI,
• the industries where AI implementation will have the most impact
• the most popular languages, libraries, and APIs used for AI development
Thus said, the numbers used in this review come from a variety of open sources like Statista, Forbes, BigDataScience, DZone and other.

Will Models Rule the World?

We reached out to the speakers to ask them about the importance of model-based decision making, how models combine with creativity, and the future of models for the industry.

Build your first chatbot using Python NLTK

A chatbot (also known as a talkbot, chatterbot, Bot, IM bot, interactive agent, or Artificial Conversational Entity) is a computer program or an artificial intelligence which conducts a conversation via auditory or textual methods. Such programs are often designed to convincingly simulate how a human would behave as a conversational partner, thereby passing the Turing test. Chatbots are typically used in dialog systems for various practical purposes including customer service or information acquisition. Some chatterbots use sophisticated natural language processing systems, but many simpler systems scan for keywords within the input, then pull a reply with the most matching keywords, or the most similar wording pattern, from a database.’

Scala for Data Science Engineering – Part 1

Data Science is an interesting field to work in, a combination of statistics and real world programming. There are number of programming languages used by Data Science Engineers, each of which has unique features. Most famous among them are Scala, Python and R. Since I am working with Scala at work, I would like to share some of the most important concepts that I come across and worthy for the beginners in Data Science Engineering.

Scala for Data Science Engineering – Part 2

Hi everyone, lets continue the discussion on Scala for Data Science Engineering. Find the first part here. In this part I will discuss about Partial Functions, Pattern Matching & Case Classes, Collections, Currying and Implicit.

Continue Reading…


Read More

BI to AI: Getting Intelligent Insights to Everyone

Tellius is a Search and AI-Powered Analytics Platform that makes it easy for users to ask questions of their business data and discover insights using machine learning, with just a single click.

Continue Reading…


Read More

Apache Spark Introduction for Beginners

An extensive introduction to Apache Spark, including a look at the evolution of the product, use cases, architecture, ecosystem components, core concepts and more.

Continue Reading…


Read More

New Jobs Sure to Emerge Alongside Artificial Intelligence

There’s a lot of doomsaying about AI pushing humans out of jobs and destroying entire industries. Is it as bad as all that? Maybe not!

Continue Reading…


Read More

Adam: “It would have been much harder without Dataquest”

Adam: “It would have been much harder without Dataquest”

Adam Zabrodski didn’t originally plan to become a data scientist. In college, he studied rocks. But over a sequence of jobs that included Ops Management at Uber and a “brutal slog” in investment banking, he found that he enjoyed analytics and working with data.

The problem? Adam had no background in math or computer science. He didn’t have the tools to take his interest in data to the next level. “It was obvious that my scope was being limited,” he said. “There’s only so much you can do in Excel.”

He experimented a little with learning Python at sites like Coursera and CodeAcademy, but it didn’t stick. “Python seemed tedious and horrible,” he said.

Then he moved to an agency that built mobile and ecommerce websites. He saw his teammates using Python, and realized how powerful it could be. He decided to give Python another shot, and settled on going through Dataquest’s Python path during his offtime, with the ultimate aim of becoming a data scientist.

Then, suddenly, he found himself without a job: he and all of his coworkers were laid off.

That’s the kind of surprise blow that might have derailed someone less dedicated, but with his days unexpectedly free, Adam decided to take those lemons and make lemonade. He doubled down on the Dataquest lessons, spending five hours a day moving through the content.

Five months later, he’d finished everything, and soon after that he landed his first job as a data scientist.

“I think I was surprised at how difficult it was until it wasn’t anymore,” he said. “I once spent two hours because of an uppercase K instead of lowercase.” But his advice to his fellow students is to push ahead anyway: “If you keep going, it gets better. Take breaks when you need to, but come back to it.”

Adam now works for world-famous yoga brand Lululemon. “My current title is Senior Analyst Guest Insights,” he said, “but really I'm a data scientist. I perform clustering models on our customers, random forests to predict churn, and am now working on deploying some lifetime value models to help with digital marketing.”

Getting that job “would have been much harder without Dataquest,” he said. “It’s a great product. I still recommend it to anyone who asks me about how to get started.”

Feeling inspired? Dive in and start (or continue) your own data science journey.

Continue Reading…


Read More

Graphs Are The Next Frontier In Data Science

GraphConnect 2018, Neo4j’s bi-annual conference, was held in New York City in mid-September. Read about what happened, and why graphs are the next big thing in data science.

Continue Reading…


Read More

Predicting spread of flu

Aleks points us to this page on flu prediction. I haven’t looked into it but it seems like an important project.

The post Predicting spread of flu appeared first on Statistical Modeling, Causal Inference, and Social Science.

Continue Reading…


Read More

A closer look at the U.S.-Mexico border

The Washington Post provides a flyover view of the barriers at the U.S.-Mexico border. It’s a combination of satellite imagery, path overlays, and information panels as you scroll. It gives an inkling of an idea of the challenges involved when people try to cross the border.

Tags: , ,

Continue Reading…


Read More

✚ Data Graphics Workflow

As I worked on a wide range of charts recently, I got to thinking about workflow. How does one get from dataset to finished data graphic? This is my process. Read More

Continue Reading…


Read More

Four short links: 18 October 2018

Git Playbook, Lessons Learned, Neural NLP, and Landscape Generation

  1. Flight Rules for Git -- the hard-earned body of knowledge recorded in manuals that list, step-by-step, what to do if X occurs and why. Essentially, they are extremely detailed, scenario-specific standard operating procedures. What to do after you shoot yourself in the foot in interesting ways with Git.
  2. Lessons Learned from Creating a Rich-Text Editor with Real-Time Collaboration -- This article describes how we approached the problem and what challenges we had to overcome in order to provide real-time collaborative editing capable of handling rich text. Check it out if you are interested in: learning what problems you may face when implementing real-time collaborative editing, building a rich-text editor with support for real-time collaboration, and how we approached collaborative editing in CKEditor 5.
  3. A Review of the Recent History of Natural Language Processing -- This post will discuss major recent advances in NLP focusing on neural network-based methods.
  4. Landscape -- software that builds the Cloud-Native Computing Foundation's landscape of products.

Continue reading Four short links: 18 October 2018.

Continue Reading…


Read More

Examining Inter-Rater Reliability in a Reality Baking Show

(This article was first published on Mattan S. Ben-Shachar, and kindly contributed to R-bloggers)

Game of Chefs is an Israeli reality cooking competition show, where chefs compete for the title of “Israel’s most talented chef”. The show has four stages: blind auditions, training camp, kitchen battles, and finals. Here I will examine the 4th season’s (Game of Chefs: Confectioner) inter-rater reliability in the blind auditions, using some simple R code.

The Format

In the blind auditions, candidates have 90 minutes to prepare a dessert, confection or other baked good, which is then sent to the show’s four judges for a blind taste test. The judges don’t see the candidate, and know nothing about him or her, until after they deliver their decision – A “pass” decision is signified by the judge granting the candidate a kitchen knife; a “fail” decision is signified by the judge not granting a knife. A candidate who receives a knife from at least three of the judges continues on to the next stage, the training camp.

The 4 judges and host, from left to right: Yossi Shitrit, Erez Komarovsky, Miri Bohadana (host), Assaf Granit, and Moshik Roth

Inter-Rater Reliability

I’ve watched all 4 episodes of the blind auditions (for purely academic reasons!), for a total of 31 candidates. For each dish, I recorded each of the four judges’ verdict (fail / pass).

Let’s load the data.

df <- read.csv('')

## Moshik Assaf Erez Yossi
## 1 1 1 1 1
## 2 1 0 0 0
## 3 1 1 1 0
## 4 1 0 0 0
## 5 0 1 1 1
## 6 1 1 1 1

We will need the following packages:

library(dplyr) # for manipulating the data
library(psych) # for computing Choen's Kappa
library(corrr) # for manipulating matrices

We can now use the psych package to compute Cohen’s Kappa coefficient for inter-rater agreement for categorical items, such as the fail/pass categories we have here. \$\kappa\$ ranges from -1 (full disagreement) through 0 (no pattern of agreement) to +1 (full agreement). Normally, \$\kappa\$ is computed between two raters, and for the case of more than two raters, the mean \$\kappa\$ across all pair-wise raters is used.


## Cohen Kappa (below the diagonal) and Weighted Kappa (above the diagonal)
## For confidence intervals and detail print with all=TRUE
## Moshik Assaf Erez Yossi
## Moshik 1.00 0.34 0.17 0.29
## Assaf 0.34 1.00 0.17 0.68
## Erez 0.17 0.17 1.00 -0.16
## Yossi 0.29 0.68 -0.16 1.00
## Average Cohen kappa for all raters 0.25
## Average weighted kappa for all raters 0.25

We can see that overall \$\kappa=0.25\$ – surprisingly low for what might be expected from a group of pristine, accomplished, professional chefs and bakers in a blind taste test.

When examining the pair-wise coefficients, we can also see that Erez seems to be in lowest agreement with each of the other judges (and even in a slight disagreement with Yossi!). This might be because Erez is new on the show (this is his first season as judge), but it might also be because of the four judges, he is the only one who is actually a confectioner (the other 3 are restaurant chefs).

For curiosity’s sake, let’s also look at the \$\kappa\$ coefficient between each judge’s rating and the total fail/pass decision, based on whether a dish got a “pass” from at least 3 judges.
df <- mutate(df,
PASS = as.numeric(rowSums(df) >= 3))

## Moshik Assaf Erez Yossi PASS
## 1 1 1 1 1 1
## 2 1 0 0 0 0
## 3 1 1 1 0 1
## 4 1 0 0 0 0
## 5 0 1 1 1 1
## 6 1 1 1 1 1

We can now use the wonderful new corrr package, which is intended for exploring correlations, but can also generally be used to manipulate any symmetric matrix in a tidy-fashion.

cohen.kappa(df)$cohen.kappa %>%
as_cordf() %>%

## # A tibble: 4 x 2
## rowname PASS
## 1 Moshik 0.619
## 2 Assaf 0.746
## 3 Erez 0.159
## 4 Yossi 0.676

Perhaps unsurprisingly (to people familiar with previous seasons of the show), it seems that Assaf’s judgment of a dish is a good indicator of whether or not a candidate will continue on to the next stage. Also, once again we see that Erez is barely aligned with the other judges’ total decision.


Every man to his taste…

Even among the experts there is little agreement on what is fine cuisine and what is not worth a doggy bag. Having said that, if you still have your heart set on competing in Game of Chefs, it seems that you should at least appeal to Assaf’s palate.

Bon Appétit!

To leave a comment for the author, please follow the link and comment on their blog: Mattan S. Ben-Shachar. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Continue Reading…


Read More

What operations professionals need to know to fuel career advancement

O’Reilly’s new survey reveals the latest operations salary trends, and the skill sets that will keep your operations career on track.

O’Reilly conducted a recent survey[1] of operations professionals, and the results offer useful information and insights to empower your career planning. As you’d expect, the survey revealed that respondents put emphasis on their salaries when evaluating their careers, but they also pay close attention to company and team attributes, job activities, role responsibilities, and evolving skill set requirements.

How operations salaries add up

Survey results show that in 2018, the median annual salary for operations professionals clocks in at $90,000. Salary increases with age and experience: someone with more than 20 years of experience can earn a median income of around $123,000.

Figure 1. Operations salaries by years of experience. Image credit: O'Reilly.

The company, team, and industry all make a difference

The larger the company, the more you should expect to earn. For example, the median salary for companies employing two-to-100 people is slightly more than $78,000. Jump to companies with more than 10,000 employees and the average income rises to $114,000. Interestingly, the age of a company is not a huge factor in determining compensation.

Team size, however, does make a difference among survey respondents. The general trend is that the larger the team size, the higher the median salary. Keep in mind that joining a bigger team does not necessarily equate to a pay increase. Larger teams usually mean more senior team members, team leads, and an established hierarchy. Increased responsibility generates increased compensation.

The industry where you work does affect compensation. About a third of survey respondents work in the software industry, and they report a median salary of $95,000. Operations professionals working for high-paying health care and medical companies see a median salary of $113,000.

Where time spent impacts dollars earned

It seems the more coding you do as part of your job, the less you earn. For survey respondents who code one-to-three hours per week, the median salary is around $94,000. Spend 20 hours or more per week on code tasks and the median salary drops to $82,000. You can attribute this to several factors. One, as you become more senior in your organization, increased responsibilities leave less time for coding. And two, if you are part of an organization with many coders, both entry-level staff and interns bring down the median salary.

For those not fond of attending meetings, here’s a survey result you might not want to see: the more time you spend in meetings, the higher the median salary. Those who spend more than 20 hours per week in meetings have a median salary of $140,000. Of course, meetings can be a proxy for responsibility, so booking yourself into every optional meeting will not increase salary automatically.

Speaking the same programming language

Scripting languages are the most popular programming languages among respondents, with Bash being the most used (66% of respondents), followed by Python (63%), and JavaScript (42%).

Go is used by 20% of respondents, and those who use Go tend to have one of the higher median salaries at $102,000, similar to LISP and Swift. This could be related to the types of companies that are pushing these programming languages. Google and Apple, for example, are very large companies and, as noted, salary and company size are related.

And what about the operating system in which respondents work? Linux tops the charts at 87% usage. Windows is also used frequently (63%), often as a mix between workstations and servers, and in some cases as a front end for Linux/Unix servers.

Education pays

Computer science, mathematics, statistics, and physics are the top fields of study for operations professionals. Advanced degrees do have a positive impact on salary. The median salary among respondents for those with a master’s is $82,000, whereas a doctorate garners a median salary of $98,000.

Planning your next operations career move

One third of survey respondents agree that the next best step to career advancement is to learn a new skill or technology. This makes sense, as the technology landscape is evolving quickly and you need to acquire new skills to keep up.

Wanting to work on more interesting or important projects is a motivator for career change among some respondents (25%), as is the desire to move into leadership roles (15%). Only 12% of respondents want to switch companies.

Other things respondents keep top of mind when pondering their operations career paths include non-monetary compensation such as job flexibility, work-life balance, location, and company culture.

Looking for more data to guide your career development? Download the 2018 Annual IT/Ops Salary Survey for free.

Continue reading What operations professionals need to know to fuel career advancement.

Continue Reading…


Read More

R Packages worth a look

log Normal Linear Regression (logNormReg)
Functions to fits simple linear regression models with log normal errors and identity link (taking the responses on the original scale). See Muggeo (20 …

Create Interactive Trelliscope Displays (trelliscopejs)
Trelliscope is a scalable, flexible, interactive approach to visualizing data (Hafen, 2013 <doi:10.1109/LDAV.2013.6675164>). This package provide …

Biclustering via Latent Block Model Adapted to Overdispersed Count Data (cobiclust)
Implementation of a probabilistic method for biclustering adapted to overdispersed count data. It is a Gamma-Poisson Latent Block Model. It also implem …

Continue Reading…


Read More

Thanks for reading!