Linear Digressions Podcast

So long, and thanks for all the fish from 2020-07-26T23:32:44

All good things must come to an end, including this podcast. This is the last episode we plan to release, and it doesn’t cover data science—it’s mostly reminiscing, thanking our wonderful audience ...

Listen

A Reality Check on AI-Driven Medical Assistants from 2020-07-19T23:51:31

The data science and artificial intelligence community has made amazing strides in the past few years to algorithmically automate portions of the healthcare process. This episode looks at two compu...

Listen

A Data Science Take on Open Policing Data from 2020-07-13T02:02:39

A few weeks ago, we put out a call for data scientists interested in issues of race and racism, or people studying how those topics can be studied with data science methods, should get in touch to ...

Listen

The Data Science Open Source Ecosystem from 2020-06-29T02:34:48

Open source software is ubiquitous throughout data science, and enables the work of nearly every data scientist in some way or another. Open source projects, however, are disproportionately maintai...

Listen

Criminology and Data Science from 2020-06-15T01:26:26

This episode features Zach Drake, a working data scientist and PhD candidate in the Criminology, Law and Society program at George Mason University. Zach specializes in bringing data science method...

Listen

Racism, the criminal justice system, and data science from 2020-06-07T23:33:53

As protests sweep across the United States in the wake of the killing of George Floyd by a Minneapolis police officer, we take a moment to dig into one of the ways that data science perpetuates and...

Listen

An interstitial word from Ben from 2020-06-05T01:38:43

A message from Ben around algorithmic bias, and how our models are sometimes reflections of ourselves.

Listen

Convolutional Neural Networks from 2020-05-31T21:46:31

This is a re-release of an episode that originally aired on April 1, 2018 If you've done image recognition or computer vision tasks with a neural network, you've probably used a convolutional neur...

Listen

Protecting Individual-Level Census Data with Differential Privacy from 2020-05-18T01:49:22

The power of finely-grained, individual-level data comes with a drawback: it compromises the privacy of potentially anyone and everyone in the dataset. Even for de-identified datasets, there can be...

Listen

Causal Trees from 2020-05-11T01:34:33

What do you get when you combine the causal inference needs of econometrics with the data-driven methodology of machine learning? Usually these two don’t go well together (deriving causal conclusio...

Listen

The Grammar Of Graphics from 2020-05-04T01:12:53

You may not realize it consciously, but beautiful visualizations have rules. The rules are often implict and manifest themselves as expectations about how the data is summarized, presented, and ann...

Listen

Gaussian Processes from 2020-04-27T01:33:43

It’s pretty common to fit a function to a dataset when you’re a data scientist. But in many cases, it’s not clear what kind of function might be most appropriate—linear? quadratic? sinusoidal? some...

Listen

Keeping ourselves honest when we work with observational healthcare data from 2020-04-20T02:43:37

The abundance of data in healthcare, and the value we could capture from structuring and analyzing that data, is a huge opportunity. It also presents huge challenges. One of the biggest challenges ...

Listen

Changing our formulation of AI to avoid runaway risks: Interview with Prof. Stuart Russell from 2020-04-13T01:55:01

AI is evolving incredibly quickly, and thinking now about where it might go next (and how we as a species and a society should be prepared) is critical. Professor Stuart Russell, an AI expert at UC...

Listen

Putting machine learning into a database from 2020-04-06T01:51:56

Most data scientists bounce back and forth regularly between doing analysis in databases using SQL and building and deploying machine learning pipelines in R or python. But if we think ahead a few ...

Listen

The work-from-home episode from 2020-03-29T22:23:42

Many of us have the privilege of working from home right now, in an effort to keep ourselves and our family safe and slow the transmission of covid-19. But working from home is an adjustment for ma...

Listen

Understanding Covid-19 transmission: what the data suggests about how the disease spreads from 2020-03-23T01:03:34

Covid-19 is turning the world upside down right now. One thing that’s extremely important to understand, in order to fight it as effectively as possible, is how the virus spreads and especially how...

Listen

Network effects re-release: when the power of a public health measure lies in widespread adoption from 2020-03-15T22:43:38

This week’s episode is a re-release of a recent episode, which we don’t usually do but it seems important for understanding what we can all do to slow the spread of covid-19. In brief, public healt...

Listen

Causal inference when you can't experiment: difference-in-differences and synthetic controls from 2020-03-09T01:39:19

When you need to untangle cause and effect, but you can’t run an experiment, it’s time to get creative. This episode covers difference in differences and synthetic controls, two observational causa...

Listen

Better know a distribution: the Poisson distribution from 2020-03-02T02:55:28

This is a re-release of an episode that originally ran on October 21, 2018. The Poisson distribution is a probability distribution function used to for events that happen in time or space. It’s su...

Listen

The Lottery Ticket Hypothesis from 2020-02-23T23:03:25

Recent research into neural networks reveals that sometimes, not all parts of the neural net are equally responsible for the performance of the network overall. Instead, it seems like (in some neu...

Listen

Interesting technical issues prompted by GDPR and data privacy concerns from 2020-02-17T01:50:20

Data privacy is a huge issue right now, after years of consumers and users gaining awareness of just how much of their personal data is out there and how companies are using it. Policies like GDPR ...

Listen

Thinking of data science initiatives as innovation initiatives from 2020-02-10T01:10:21

Put yourself in the shoes of an executive at a big legacy company for a moment, operating in virtually any market vertical: you’re constantly hearing that data science is revolutionizing the world ...

Listen

Building a curriculum for educating data scientists: Interview with Prof. Xiao-Li Meng from 2020-02-02T23:36:23

As demand for data scientists grows, and it remains as relevant as ever that practicing data scientists have a solid methodological and technical foundation for their work, higher education institu...

Listen

Running experiments when there are network effects from 2020-01-27T00:13:52

Traditional A/B tests assume that whether or not one person got a treatment has no effect on the experiment outcome for another person. But that’s not a safe assumption, especially when there are n...

Listen

Zeroing in on what makes adversarial examples possible from 2020-01-20T02:41:20

Adversarial examples are really, really weird: pictures of penguins that get classified with high certainty by machine learning algorithms as drumsets, or random noise labeled as pandas, or any one...

Listen

Unsupervised Dimensionality Reduction: UMAP vs t-SNE from 2020-01-13T00:53:19

Dimensionality reduction redux: this episode covers UMAP, an unsupervised algorithm designed to make high-dimensional data easier to visualize, cluster, etc. It’s similar to t-SNE but has some adva...

Listen

Data scientists: beware of simple metrics from 2020-01-05T22:54:57

Picking a metric for a problem means defining how you’ll measure success in solving that problem. Which sounds important, because it is, but oftentimes new data scientists only get experience with ...

Listen

Communicating data science, from academia to industry from 2019-12-30T01:53:14

For something as multifaceted and ill-defined as data science, communication and sharing best practices across the field can be extremely valuable but also extremely, well, multifaceted and ill-def...

Listen

Optimizing for the short-term vs. the long-term from 2019-12-23T02:50:53

When data scientists run experiments, like A/B tests, it’s really easy to plan on a period of a few days to a few weeks for collecting data. The thing is, the change that’s being evaluated might ha...

Listen

Interview with Prof. Andrew Lo, on using data science to inform complex business decisions from 2019-12-16T03:15:09

This episode features Prof. Andrew Lo, the author of a paper that we discussed recently on Linear Digressions, in which Prof. Lo uses data to predict whether a medicine in the development pipeline ...

Listen

Using machine learning to predict drug approvals from 2019-12-08T22:56:05

One of the hottest areas in data science and machine learning right now is healthcare: the size of the healthcare industry, the amount of data it generates, and the myriad improvements possible in ...

Listen

Facial recognition, society, and the law from 2019-12-02T03:14:14

Facial recognition being used in everyday life seemed far-off not too long ago. Increasingly, it’s being used and advanced widely and with increasing speed, which means that our technical capabilit...

Listen

Lessons learned from doing data science, at scale, in industry from 2019-11-25T00:45:42

If you’ve taken a machine learning class, or read up on A/B tests, you likely have a decent grounding in the theoretical pillars of data science. But if you’re in a position to have actually built ...

Listen

Varsity A/B Testing from 2019-11-18T02:09:46

When you want to understand if doing something causes something else to happen, like if a change to a website causes and dip or rise in downstream conversions, the gold standard analysis method is ...

Listen

The Care and Feeding of Data Scientists: Growing Careers from 2019-11-11T03:44:18

In the third and final installment of a conversation with Michelangelo D’Agostino, VP of Data Science and Engineering at Shoprunner, about growing and mentoring data scientists on your team. Some o...

Listen

The Care and Feeding of Data Scientists: Recruiting and Hiring Data Scientists from 2019-11-04T00:21:56

This week’s episode is the second in a three-part interview series with Michelangelo D’Agostino, VP of Data Science at Shoprunner. This discussion centers on building a team, which means recruiting...

Listen

The Care and Feeding of Data Scientists: Becoming a Data Science Manager from 2019-10-28T01:27:58

Data science management isn’t easy, and many data scientists are finding themselves learning on the job how to manage data science teams as they get promoted into more formal leadership roles. O’Re...

Listen

Procella: YouTube's super-system for analytics data storage from 2019-10-21T01:27:45

If you’re trying to manage a project that serves up analytics data for a few very distinct uses, you’d be wise to consider having custom solutions for each use case that are optimized for the needs...

Listen

What's really so hard about feature engineering? from 2019-10-06T22:37:49

Feature engineering is ubiquitous but gets surprisingly difficult surprisingly fast. What could be so complicated about just keeping track of what data you have, and how you made it? A lot, as it t...

Listen

Data storage for analytics: stars and snowflakes from 2019-09-30T11:22:15

If you’re a data scientist or data engineer thinking about how to store data for analytics uses, one of the early choices you’ll have to make (or live with, if someone else made it) is how to lay o...

Listen

Data storage: transactions vs. analytics from 2019-09-23T01:49:59

Data scientists and software engineers both work with databases, but they use them for different purposes. So if you’re a data scientist thinking about the best way to store and access data for you...

Listen

GROVER: an algorithm for making, and detecting, fake news from 2019-09-16T03:21:34

There are a few things that seem to be very popular in discussions of machine learning algorithms these days. First is the role that algorithms play now, or might play in the future, when it comes ...

Listen

Data science teams as innovation initiatives from 2019-09-09T02:24:55

When a big, established company is thinking about their data science strategy, chances are good that whatever they come up with, it’ll be somewhat at odds with the company’s current structure and p...

Listen

Organizational Models for Data Scientists from 2019-08-25T23:06:52

When data science is hard, sometimes it’s because the algorithms aren’t converging or the data is messy, and sometimes it’s because of organizational or business issues: the data scientists aren’t ...

Listen

Data Shapley from 2019-08-19T02:38:16

We talk often about which features in a dataset are most important, but recently a new paper has started making the rounds that turns the idea of importance on its head: Data Shapley is an algorith...

Listen

Putting the "science" in data science: the scientific method, the null hypothesis, and p-hacking from 2019-07-29T01:30:54

The modern scientific method is one of the greatest (perhaps the greatest?) system we have for discovering knowledge about the world. It’s no surprise then that many data scientists have found thei...

Listen

Interleaving from 2019-07-22T12:20:58

If you’re Google or Netflix, and you have a recommendation or search system as part of your bread and butter, what’s the best way to test improvements to your algorithm? A/B testing is the canonica...

Listen

Deepfakes from 2019-07-01T01:25:07

Generative adversarial networks (GANs) are producing some of the most realistic artificial videos we’ve ever seen. These videos are usually called “deepfakes”. Even to an experienced eye, it can be...

Listen

Revisiting Biased Word Embeddings from 2019-06-24T00:26:07

The topic of bias in word embeddings gets yet another pass this week. It all started a few years ago, when an analogy task performed on Word2Vec embeddings showed some indications of gender bias ar...

Listen

Attention in Neural Nets from 2019-06-17T00:28:35

There’s been a lot of interest lately in the attention mechanism in neural nets—it’s got a colloquial name (who’s not familiar with the idea of “attention”?) but it’s more like a technical trick th...

Listen

Interview with Joel Grus from 2019-06-10T02:05:47

This week’s episode is a special one, as we’re welcoming a guest: Joel Grus is a data scientist with a strong software engineering streak, and he does an impressive amount of speaking, writing, and...

Listen

Re - Release: Factorization Machines from 2019-06-03T01:32:39

What do you get when you cross a support vector machine with matrix factorization? You get a factorization machine, and a darn fine algorithm for recommendation engines.

Listen

Re-release: Auto-generating websites with deep learning from 2019-05-27T02:01:11

We've already talked about neural nets in some detail (links below), and in particular we've been blown away by the way that image recognition from convolutional neural nets can be fed into recurre...

Listen

Advice to those trying to get a first job in data science from 2019-05-19T21:50:13

We often hear from folks wondering what advice we can give them as they search for their first job in data science. What does a hiring manager look for? Should someone focus on taking classes onlin...

Listen

Re - Release: Machine Learning Technical Debt from 2019-05-12T23:07:14

This week, we've got a fun paper by our friends at Google about the hidden costs of maintaining machine learning workflows. If you've worked in software before, you're probably familiar with the i...

Listen

Estimating Software Projects, and Why It's Hard from 2019-05-05T22:27:24

If you’re like most software engineers and, especially, data scientists, you find it really hard to make accurate estimates of how long a project will take to complete. Don’t feel bad: statistics i...

Listen

The Black Hole Algorithm from 2019-04-29T00:55:57

53.5 million light-years away, there’s a gigantic galaxy called M87 with something interesting going on inside it. Between Einstein’s theory of relativity and the motion of a group of stars in the ...

Listen

Structure in AI from 2019-04-21T22:29:02

As artificial intelligence algorithms get applied to more and more domains, a question that often arises is whether to somehow build structure into the algorithm itself to mimic the structure of th...

Listen

The Great Data Science Specialist vs. Generalist Debate from 2019-04-15T00:55:41

It’s not news that data scientists are expected to be capable in many different areas (writing software, designing experiments, analyzing data, talking to non-technical stakeholders). One thing tha...

Listen

Google X, and Taking Risks the Smart Way from 2019-04-08T01:10:57

If you work in data science, you’re well aware of the sheer volume of high-risk, high-reward projects that are hypothetically possible. The fact that they’re high-reward means they’re exciting to t...

Listen

Statistical Significance in Hypothesis Testing from 2019-04-01T01:34:53

When you are running an AB test, one of the most important questions is how much data to collect. Collect too little, and you can end up drawing the wrong conclusion from your experiment. But in a ...

Listen

The Language Model Too Dangerous to Release from 2019-03-25T01:39:45

OpenAI recently created a cutting-edge new natural language processing model, but unlike all their other projects so far, they have not released it to the public. Why? It seems to be a little too g...

Listen

The cathedral and the bazaar from 2019-03-17T22:47:01

Imagine you have two choices of how to build something: top-down and controlled, with a few people playing a master designer role, or bottom-up and free-for-all, with nobody playing an explicit arc...

Listen

AlphaStar from 2019-03-11T01:18:26

It’s time for our latest installation in the series on artificial intelligence agents beating humans at games that we thought were safe from the robots. In this case, the game is StarCraft, and the...

Listen

Are machine learning engineers the new data scientists? from 2019-03-04T02:57:19

For many data scientists, maintaining models and workflows in production is both a huge part of their job and not something they necessarily trained for if their background is more in statistics or...

Listen

Interview with Alex Radovic, particle physicist turned machine learning researcher from 2019-02-25T01:59:03

You’d be hard-pressed to find a field with bigger, richer, and more scientifically valuable data than particle physics. Years before “data scientist” was even a term, particle physicists were inven...

Listen

K Nearest Neighbors from 2019-02-17T23:57:23

K Nearest Neighbors is an algorithm with secrets. On one hand, the algorithm itself is as straightforward as possible: find the labeled points nearest the point that you need to predict, and make a...

Listen

Not every deep learning paper is great. Is that a problem? from 2019-02-11T00:06:33

Deep learning is a field that’s growing quickly. That’s good! There are lots of new deep learning papers put out every day. That’s good too… right? What if not every paper out there is particularly...

Listen

The Assumptions of Ordinary Least Squares from 2019-02-03T23:24:15

Ordinary least squares (OLS) is often used synonymously with linear regression. If you’re a data scientist, machine learner, or statistician, you bump into it daily. If you haven’t had the opportun...

Listen

Quantile Regression from 2019-01-28T01:27:40

Linear regression is a great tool if you want to make predictions about the mean value that an outcome will have given certain values for the inputs. But what if you want to predict the median? Or ...

Listen

Heterogeneous Treatment Effects from 2019-01-20T23:57:56

When data scientists use a linear regression to look for causal relationships between a treatment and an outcome, what they’re usually finding is the so-called average treatment effect. In other wo...

Listen

Pre-training language models for natural language processing problems from 2019-01-14T00:42:31

When you build a model for natural language processing (NLP), such as a recurrent neural network, it helps a ton if you’re not starting from zero. In other words, if you can draw upon other dataset...

Listen

Re-release: Word2Vec from 2018-12-31T01:56:03

Bringing you another old classic this week, as we gear up for 2019! See you next week with new content. Word2Vec is probably the go-to algorithm for vectorizing text data these days. Which makes ...

Listen

Re - Release: The Cold Start Problem from 2018-12-23T20:23:33

We’re taking a break for the holidays, chilling with the dog and an eggnog (Katie) and the cat and some spiced cider (Ben). Here’s an episode from a while back for you to enjoy. See you again in 20...

Listen

Convex (and non-convex) Optimization from 2018-12-17T03:06:42

Convex optimization is one of the keys to data science, both because some problems straight-up call for optimization solutions and because popular algorithms like a gradient descent solution to ord...

Listen

The Normal Distribution and the Central Limit Theorem from 2018-12-09T18:58:28

When you think about it, it’s pretty amazing that we can draw conclusions about huge populations, even the whole world, based on datasets that are comparatively very small (a few thousand, or a few...

Listen

Software 2.0 from 2018-12-02T23:23:05

Neural nets are a way you can model a system, sure, but if you take a step back, squint, and tilt your head, they can also be called… software? Not in the sense that they’re written in code, but in...

Listen

Limitations of Deep Nets for Computer Vision from 2018-11-18T19:01:28

Deep neural nets have a deserved reputation as the best-in-breed solution for computer vision problems. But there are many aspects of human vision that we take for granted but where neural nets str...

Listen

Building Data Science Teams from 2018-11-12T03:16:46

At many places, data scientists don’t work solo anymore—it’s a team sport. But data science teams aren’t simply teams of data scientists working together. Instead, they’re usually cross-functional ...

Listen

Optimized Optimized Web Crawling from 2018-11-04T21:38:32

Last week’s episode, about methods for optimized web crawling logic, left off on a bit of a cliffhanger: the data scientists had found a solution to the problem, but it wasn’t something that the en...

Listen

Optimized Web Crawling from 2018-10-28T23:56:36

Got a fun optimization problem for you this week! It’s a two-for-one: how do you optimize the web crawling logic of an operation like Google search so that the results are, on average, as up-to-dat...

Listen

Searching for Datasets with Google from 2018-10-15T01:11:58

If you wanted to find a dataset of jokes, how would you do it? What about a dataset of podcast episodes? If your answer was “I’d try Google,” you might have been disappointed—Google is a great sear...

Listen

It's our fourth birthday from 2018-10-08T02:33:55

We started Linear Digressions 4 years ago… this isn’t a technical episode, just two buddies shooting the breeze about something we’ve somehow built together.

Listen

Gigantic Searches in Particle Physics from 2018-09-30T18:52:04

This week, we’re dusting off the ol’ particle physics PhD to bring you an episode about ambitious new model-agnostic searches for new particles happening at CERN. Traditionally, new particles have ...

Listen

Data Engineering from 2018-09-24T01:10:13

If you’re a data scientist, you know how important it is to keep your data orderly, clean, moving smoothly between different systems, well-documented… there’s a ton of work that goes into building ...

Listen

Text Analysis for Guessing the NYTimes Op-Ed Author from 2018-09-16T18:13:09

A very intriguing op-ed was published in the NY Times recently, in which the author (a senior official in the Trump White House) claimed to be a minor saboteur of sorts, acting with his or her coll...

Listen

The Three Types of Data Scientists, and What They Actually Do from 2018-09-09T19:00:09

If you've been in data science for more than a year or two, chances are you've noticed changes in the field as it's grown and matured. And if you're newer to the field, you may feel like there's a ...

Listen

Agile Development for Data Scientists, Part 2: Where Modifications Help from 2018-08-26T19:59:12

There's just too much interesting stuff at the intersection of agile software development and data science for us to be able to cover it all in one episode, so this week we're picking up where we l...

Listen

Agile Development for Data Scientists, Part 1: The Good from 2018-08-19T18:06:19

If you're a data scientist at a firm that does a lot of software building, chances are good that you've seen or heard engineers sometimes talking about "agile software development." If you don't wo...

Listen

Re - Release: How To Lose At Kaggle from 2018-08-13T02:31:51

We've got a classic for you this week as we take a week off for the dog days of summer. See you again next week! Competing in a machine learning competition on Kaggle is a kind of rite of passage ...

Listen

Troubling Trends In Machine Learning Scholarship from 2018-08-06T01:31:03

There's a lot of great machine learning papers coming out every day--and, if we're being honest, some papers that are not as great as we'd wish. In some ways this is symptomatic of a field that's g...

Listen

Can Fancy Running Shoes Cause You To Run Faster? from 2018-07-29T19:12:09

The stars aligned for me (Katie) this past weekend: I raced my first half-marathon in a long time and got to read a great article from the NY Times about a new running shoe that Nike claims can mak...

Listen

Compliance Bias from 2018-07-22T16:07:54

When you're using an AB test to understand the effect of a treatment, there are a lot of assumptions about how the treatment (and control, for that matter) get applied. For example, it's easy to th...

Listen

AI Winter from 2018-07-15T20:11:52

Artificial Intelligence has been widely lauded as a solution to almost any problem. But as we justapose the hype in the field against the real-world benefits we see, it raises the question: Are we ...

Listen

Rerelease: How to Find New Things to Learn from 2018-07-08T22:28:29

We like learning on vacation. And we're on vacation, so we thought we'd re-air this episode about how to learn. Original Episode: https://lineardigressions.com/episodes/2017/5/14/how-to-find-new-t...

Listen

Rerelease: Space Codes from 2018-07-02T04:36:56

We're on vacation on Mars, so we won't be communicating with you all directly this week. Though, if we wanted to, we could probably use this episode to help get started. Original Episode: http://l...

Listen

Rerelease: Anscombe's Quartet from 2018-06-25T01:20:25

We're on vacation, so we hope you enjoy this episode while we each sip cocktails on the beach. Original Episode: http://lineardigressions.com/episodes/2017/6/18/anscombes-quartetOriginal Summary: ...

Listen

Rerelease: Hurricanes Produced from 2018-06-18T17:00:14

Now that hurricane season is upon us again (and we are on vacation), we thought a look back on our hurricane forecasting episode was prudent. Stay safe out there.

Listen

GDPR from 2018-06-11T02:24:45

By now, you have probably heard of GDPR, the EU's new data privacy law. It's the reason you've been getting so many emails about everyone's updated privacy policy. In this episode, we talk about s...

Listen

Git for Data Scientists from 2018-06-03T17:52:23

If you're a data scientist, chances are good that you've heard of git, which is a system for version controlling code. Chances are also good that you're not quite as up on git as you want to be--gi...

Listen

Analytics Maturity from 2018-05-20T15:09:39

Data science and analytics are hot topics in business these days, but for a lot of folks looking to bring data into their organization, it can be hard to know where to start and what it looks like ...

Listen

SHAP: Shapley Values in Machine Learning from 2018-05-13T14:24:38

Shapley values in machine learning are an interesting and useful enough innovation that we figured hey, why not do a two-parter? Our last episode focused on explaining what Shapley values are: they...

Listen

Game Theory for Model Interpretability: Shapley Values from 2018-05-07T02:17:19

As machine learning models get into the hands of more and more users, there's an increasing expectation that black box isn't good enough: users want to understand why the model made a given predict...

Listen

AutoML from 2018-04-30T02:50:23

If you were a machine learning researcher or data scientist ten years ago, you might have spent a lot of time implementing individual algorithms like decision trees and neural networks by hand. If ...

Listen

CPUs, GPUs, TPUs: Hardware for Deep Learning from 2018-04-23T02:52:42

A huge part of the ascent of deep learning in the last few years is related to advances in computer hardware that makes it possible to do the computational heavy lifting required to build models wi...

Listen

A Technical Introduction to Capsule Networks from 2018-04-16T01:12:25

Last episode we talked conceptually about capsule networks, the latest and greatest computer vision innovation to come out of Geoff Hinton's lab. This week we're getting a little more into the tech...

Listen

A Conceptual Introduction to Capsule Networks from 2018-04-09T01:59:54

Convolutional nets are great for image classification... if this were 2016. But it's 2018 and Canada's greatest neural networker Geoff Hinton has some new ideas, namely capsule networks. Capsule ne...

Listen

Convolutional Neural Nets from 2018-04-02T01:40:08

If you've done image recognition or computer vision tasks with a neural network, you've probably used a convolutional neural net. This episode is all about the architecture and implementation detai...

Listen

Google Flu Trends from 2018-03-26T01:20:41

It's been a nasty flu season this year. So we were remembering a story from a few years back (but not covered yet on this podcast) about when Google tried to predict flu outbreaks faster than the C...

Listen

How to pick projects for a professional data science team from 2018-03-19T03:07:33

This week's episodes is for data scientists, sure, but also for data science managers and executives at companies with data science teams. These folks all think very differently about the same ques...

Listen

Autoencoders from 2018-03-12T01:47:10

Autoencoders are neural nets that are optimized for creating outputs that... look like the inputs to the network. Turns out this is a not-too-shabby way to do unsupervised machine learning with neu...

Listen

When Private Data Isn't Private Anymore from 2018-03-05T03:35:26

After all the back-patting around making data science datasets and code more openly available, we figured it was time to also dump a bucket of cold water on everyone's heads and talk about the thin...

Listen

What makes a machine learning algorithm "superhuman"? from 2018-02-26T04:52:57

A few weeks ago, we podcasted about a neural network that was being touted as "better than doctors" in diagnosing pneumonia from chest x-rays, and how the underlying dataset used to train the algor...

Listen

Open Data and Open Science from 2018-02-19T01:39:16

One interesting trend we've noted recently is the proliferation of papers, articles and blog posts about data science that don't just tell the result--they include data and code that allow anyone t...

Listen

Defining the quality of a machine learning production system from 2018-02-12T02:00:45

Building a machine learning system and maintaining it in production are two very different things. Some folks over at Google wrote a paper that shares their thoughts around all the items you might ...

Listen

Auto-generating websites with deep learning from 2018-02-04T23:02:11

We've already talked about neural nets in some detail (links below), and in particular we've been blown away by the way that image recognition from convolutional neural nets can be fed into recurre...

Listen

The Case for Learned Index Structures, Part 2: Hash Maps and Bloom Filters from 2018-01-29T02:15:43

Last week we started the story of how you could use a machine learning model in place of a data structure, and this week we wrap up with an exploration of Bloom Filters and Hash Maps. Just like las...

Listen

The Case for Learned Index Structures, Part 1: B-Trees from 2018-01-22T02:32:28

Jeff Dean and his collaborators at Google are turning the machine learning world upside down (again) with a recent paper about how machine learning models can be used as surprisingly effective subs...

Listen

Challenges with Using Machine Learning to Classify Chest X-Rays from 2018-01-15T01:57:21

Another installment in our "machine learning might not be a silver bullet for solving medical problems" series. This week, we have a high-profile blog post that has been making the rounds for the l...

Listen

The Fourier Transform from 2018-01-08T02:07:46

The Fourier transform is one of the handiest tools in signal processing for dealing with periodic time series data. Using a Fourier transform, you can break apart a complex periodic function into a...

Listen

Statistics of Beer from 2018-01-02T01:57:54

What better way to kick off a new year than with an episode on the statistics of brewing beer?

Listen

Re - Release: Random Kanye from 2017-12-24T19:07:48

We have a throwback episode for you today as we take the week off to enjoy the holidays. This week: what happens when you have a markov chain that generates mashup Kanye West lyrics with Bible vers...

Listen

Debiasing Word Embeddings from 2017-12-18T02:31:01

When we covered the Word2Vec algorithm for embedding words, we mentioned parenthetically that the word embeddings it produces can sometimes be a little bit less than ideal--in particular, gender bi...

Listen

The Kernel Trick and Support Vector Machines from 2017-12-11T01:58:41

Picking up after last week's episode about maximal margin classifiers, this week we'll go into the kernel trick and how that (combined with maximal margin algorithms) gives us the much-vaunted supp...

Listen

Maximal Margin Classifiers from 2017-12-04T04:03:02

Maximal margin classifiers are a way of thinking about supervised learning entirely in terms of the decision boundary between two classes, and defining that boundary in a way that maximizes the dis...

Listen

Re - Release: The Cocktail Party Problem from 2017-11-27T02:11:06

Grab a cocktail, put on your favorite karaoke track, and let’s talk some more about disentangling audio data!

Listen

Clustering with DBSCAN from 2017-11-20T03:08:14

DBSCAN is a density-based clustering algorithm for doing unsupervised learning. It's pretty nifty: with just two parameters, you can specify "dense" regions in your data, and grow those regions ou...

Listen

The Kaggle Survey on Data Science from 2017-11-13T02:49:44

Want to know what's going on in data science these days? There's no better way than to analyze a survey with over 16,000 responses that recently released by Kaggle. Kaggle asked practicing and as...

Listen

Machine Learning: The High Interest Credit Card of Technical Debt from 2017-11-06T04:35:17

This week, we've got a fun paper by our friends at Google about the hidden costs of maintaining machine learning workflows. If you've worked in software before, you're probably familiar with the i...

Listen

Improving Upon a First-Draft Data Science Analysis from 2017-10-30T01:38:28

There are a lot of good resources out there for getting started with data science and machine learning, where you can walk through starting with a dataset and ending up with a model and set of pred...

Listen

Survey Raking from 2017-10-23T02:51:49

It's quite common for survey respondents not to be representative of the larger population from which they are drawn. But if you're a researcher, you need to study the larger population using data...

Listen

Happy Hacktoberfest from 2017-10-16T01:46:19

It's the middle of October, so you've already made two pull requests to open source repos, right? If you have no idea what we're talking about, spend the next 20 minutes or so with us talking about...

Listen

Re - Release: Kalman Runners from 2017-10-09T02:28:24

In honor of the Chicago marathon this weekend (and due in large part to Katie recovering from running in it...) we have a re-release of an episode about Kalman filters, which is part algorithm part...

Listen

Neural Net Dropout from 2017-10-02T03:32:56

Neural networks are complex models with many parameters and can be prone to overfitting. There's a surprisingly simple way to guard against this: randomly destroy connections between hidden units,...

Listen

Disciplined Data Science from 2017-09-25T01:49:41

As data science matures as a field, it's becoming clearer what attributes a data science team needs to have to elevate their work to the next level. Most of our episodes are about the cool work be...

Listen

Hurricane Forecasting from 2017-09-18T01:37:15

It's been a busy hurricane season in the Southeastern United States, with millions of people making life-or-death decisions based on the forecasts around where the hurricanes will hit and with what...

Listen

Finding Spy Planes with Machine Learning from 2017-09-11T02:11:22

There are law enforcement surveillance aircraft circling over the United States every day, and in this episode, we'll talk about how some folks at BuzzFeed used public data and machine learning to ...

Listen

Data Provenance from 2017-09-04T01:35

Software engineers are familiar with the idea of versioning code, so you can go back later and revive a past state of the system. For data scientists who might want to reconstruct past models, tho...

Listen

Adversarial Examples from 2017-08-28T02:25:14

Even as we rely more and more on machine learning algorithms to help with everyday decision-making, we're learning more and more about how they're frighteningly easy to fool sometimes. Today we ha...

Listen

Jupyter Notebooks from 2017-08-21T01:09:32

This week's episode is just in time for JupyterCon in NYC, August 22-25... Jupyter notebooks are probably familiar to a lot of data nerds out there as a great open-source tool for exploring data, ...

Listen

Curing Cancer with Machine Learning is Super Hard from 2017-08-14T01:49:52

Today, a dispatch on what can go wrong when machine learning hype outpaces reality: a high-profile partnership between IBM Watson and MD Anderson Cancer Center has recently hit the rocks as it turn...

Listen

KL Divergence from 2017-08-07T03:07:15

Kullback Leibler divergence, or KL divergence, is a measure of information loss when you try to approximate one distribution with another distribution. It comes to us originally from information t...

Listen

Sabermetrics from 2017-07-31T01:15:37

It's moneyball time! SABR (the Society for American Baseball Research) is the world's largest organization of statistics-minded baseball enthusiasts, who are constantly applying the craft of scien...

Listen

What Data Scientists Can Learn from Software Engineers from 2017-07-24T01:52:26

We're back again with friend of the pod Walt, former software engineer extraordinaire and current data scientist extraordinaire, to talk about some best practices from software engineering that are...

Listen

Software Engineering to Data Science from 2017-07-17T02:36

Data scientists and software engineers often work side by side, building out and scaling technical products and services that are data-heavy but also require a lot of software engineering to build ...

Listen

Re-Release: Fighting Cholera with Data, 1854 from 2017-07-10T00:19:56

This episode was first released in November 2014. In the 1850s, there were a lot of things we didn’t know yet: how to create an airplane, how to split an atom, or how to control the spread of a co...

Listen

Re-Release: Data Mining Enron from 2017-07-02T17:53:42

This episode was first release in February 2015. In 2000, Enron was one of the largest and companies in the world, praised far and wide for its innovations in energy distribution and many other ma...

Listen

Factorization Machines from 2017-06-26T02:23:14

What do you get when you cross a support vector machine with matrix factorization? You get a factorization machine, and a darn fine algorithm for recommendation engines.

Listen

Anscombe's Quartet from 2017-06-19T02:19:56

Anscombe's Quartet is a set of four datasets that have the same mean, variance and correlation but look very different. It's easy to think that having a good set of summary statistics (like mean, ...

Listen

Page Rank from 2017-06-05T01:46:35

The year: 1998. The size of the web: 150 million pages. The problem: information retrieval. How do you find the "best" web pages to return in response to a query? A graduate student named Larry...

Listen

Fractional Dimensions from 2017-05-29T02:54:46

We chat about fractional dimensions, and what the actual heck those are.

Listen

Things You Learn When Building Models for Big Data from 2017-05-22T01:44:13

As more and more data gets collected seemingly every day, and data scientists use that data for modeling, the technical limits associated with machine learning on big datasets keep getting pushed b...

Listen

How to Find New Things to Learn from 2017-05-15T01:49:26

If you're anything like us, you a) always are curious to learn more about data science and machine learning and stuff, and b) are usually overwhelmed by how much content is out there (not all of it...

Listen

Federated Learning from 2017-05-08T01:50:40

As machine learning makes its way into more and more mobile devices, an interesting question presents itself: how can we have an algorithm learn from training data that's being supplied as users in...

Listen

Word2Vec from 2017-05-01T02:17:36

Word2Vec is probably the go-to algorithm for vectorizing text data these days. Which makes sense, because it is wicked cool. Word2Vec has it all: neural networks, skip-grams and bag-of-words impl...

Listen

Feature Processing for Text Analytics from 2017-04-24T02:17:24

It seems like every day there's more and more machine learning problems that involve learning on text data, but text itself makes for fairly lousy inputs to machine learning algorithms. That's why...

Listen

Education Analytics from 2017-04-17T02:09:26

This week we'll hop into the rapidly developing industry around predictive analytics for education. For many of the students who eventually drop out, data science is showing that there might be ea...

Listen

A Technical Deep Dive on Stanley, the First Self-Driving Car from 2017-04-10T01:50:01

In our follow-up episode to last week's introduction to the first self-driving car, we will be doing a technical deep dive this week and talking about the most important systems for getting a car t...

Listen

An Introduction to Stanley, the First Self-Driving Car from 2017-04-03T01:34:17

In October 2005, 23 cars lined up in the desert for a 140 mile race. Not one of those cars had a driver. This was the DARPA grand challenge to see if anyone could build an autonomous vehicle capa...

Listen

Feature Importance from 2017-03-27T01:53:25

Figuring out what features actually matter in a model is harder to figure out than you might first guess. When a human makes a decision, you can just ask them--why did you do that? But with machi...

Listen

Space Codes! from 2017-03-20T02:50:57

It's hard to get information to and from Mars. Mars is very far away, and expensive to get to, and the bandwidth for passing messages with Earth is not huge. The messages you do pass have to trav...

Listen

Finding (and Studying) Wikipedia Trolls from 2017-03-13T01:44:55

You may be shocked to hear this, but sometimes, people on the internet can be mean. For some of us this is just a minor annoyance, but if you're a maintainer or contributor of a large project like...

Listen

A Sprint Through What's New in Neural Networks from 2017-03-06T03:27:12

Advances in neural networks are moving fast enough that, even though it seems like we talk about them all the time around here, it also always seems like we're barely keeping up. So this week we h...

Listen

Stein's Paradox from 2017-02-27T02:51:41

When you're estimating something about some object that's a member of a larger group of similar objects (say, the batting average of a baseball player, who belongs to a baseball team), how should y...

Listen

Empirical Bayes from 2017-02-20T03:30:06

Say you're looking to use some Bayesian methods to estimate parameters of a system. You've got the normalization figured out, and the likelihood, but the prior... what should you use for a prior? ...

Listen

Endogenous Variables and Measuring Protest Effectiveness from 2017-02-13T03:31

Have you been out protesting lately, or watching the protests, and wondered how much effect they might have on lawmakers? It's a tricky question to answer, since usually we need randomly distribut...

Listen

Calibrated Models from 2017-02-06T01:56:12

Remember last week, when we were talking about how great the ROC curve is for evaluating models? How things change... This week, we're exploring calibrated risk models, because that's a kind of m...

Listen

Rock the ROC Curve from 2017-01-30T03:38:46

This week: everybody's favorite WWII-era classifier metric! But it's not just for winning wars, it's a fantastic go-to metric for all your classifier quality needs.

Listen

Ensemble Algorithms from 2017-01-23T02:31:26

If one machine learning model is good, are two models better? In a lot of cases, the answer is yes. If you build many ok models, and then bring them all together and use them in combination to ma...

Listen

How to evaluate a translation: BLEU scores from 2017-01-16T01:59:01

As anyone who's encountered a badly translated text could tell you, not all translations are created equal. Some translations are smooth, fluent and sound like a poet wrote them; some are jerky, n...

Listen

Zero Shot Translation from 2017-01-09T03:20:57

Take Google-size data, the flexibility of a neural net, and all (well, most) of the languages of the world, and what you end up with is a pile of surprises. This episode is about some interesting ...

Listen

Google Neural Machine Translation from 2017-01-02T01:44:23

Recently, Google swapped out the backend for Google Translate, moving from a statistical phrase-based method to a recurrent neural network. This marks a big change in methodology: the tried-and-tr...

Listen

Data and the Future of Medicine : Interview with Precision Medicine Initiative researcher Matt Might from 2016-12-26T01:19:24

Today we are delighted to bring you an interview with Matt Might, computer scientist and medical researcher extraordinaire and architect of President Obama's Precision Medicine Initiative. As the ...

Listen

Special Crossover Episode: Partially Derivative interview with White House Data Scientist DJ Patil from 2016-12-18T17:53:52

We have the pleasure of bringing you a very special crossover episode this week: our friends at Partially Derivative (another great podcast about data science, you should check it out) recently int...

Listen

How to Lose at Kaggle from 2016-12-12T04:28:57

Competing in a machine learning competition on Kaggle is a kind of rite of passage for data scientists. Losing unexpectedly at the very end of the contest is also something that a lot of us have e...

Listen

Attacking Discrimination in Machine Learning from 2016-12-05T03:38:54

Imagine there's an important decision to be made about someone, like a bank deciding whether to extend a loan, or a school deciding to admit a student--unfortunately, we're all too aware that discr...

Listen

Recurrent Neural Nets from 2016-11-28T02:47:44

This week, we're doing a crash course in recurrent neural networks--what the structural pieces are that make a neural net recurrent, how that structure helps RNNs solve certain time series problems...

Listen

Stealing a PIN with signal processing and machine learning from 2016-11-21T02:32:21

Want another reason to be paranoid when using the free coffee shop wifi? Allow us to introduce WindTalker, a system that cleverly combines a dose of signal processing with a dash of machine learni...

Listen

Neural Net Cryptography from 2016-11-14T04:06:57

Cryptography used to be the domain of information theorists and spies. There's a new player now: neural networks. Given the task of communicating securely, neural networks are inventing new encry...

Listen

Deep Blue from 2016-11-07T04:20:48

In 1997, Deep Blue was the IBM algorithm/computer that did what no one, at the time, though possible: it beat the world's best chess player. It turns out, though, that one of the most important mo...

Listen

Organizing Google's Datasets from 2016-10-31T02:17:26

If you're a data scientist, there's a good chance you're used to working with a lot of data. But there's a lot of data, and then there's Google-scale amounts of data. Keeping all that data organi...

Listen

Fighting Cancer with Data Science: Followup from 2016-10-24T01:58:41

A few months ago, Katie started on a project for the Vice President's Cancer Moonshot surrounding how data can be used to better fight cancer. The project is all wrapped up now, so we wanted to te...

Listen

The 19-year-old determining the US election from 2016-10-17T01:01:23

Sick of the presidential election yet? We are too, but there's still almost a month to go, so let's just embrace it together. This week, we'll talk about one of the presidential polls, which has ...

Listen

How to Steal a Model from 2016-10-09T22:57:24

What does it mean to steal a model? It means someone (the thief, presumably) can re-create the predictions of the model without having access to the algorithm itself, or the training data. Sound ...

Listen

Regularization from 2016-10-03T02:13:50

Lots of data is usually seen as a good thing. And it is a good thing--except when it's not. In a lot of fields, a problem arises when you have many, many features, especially if there's a somewha...

Listen

The Cold Start Problem from 2016-09-26T02:24:38

You might sometimes find that it's hard to get started doing something, but once you're going, it gets easier. Turns out machine learning algorithms, and especially recommendation engines, feel th...

Listen

Open Source Software for Data Science from 2016-09-19T04:27:40

If you work in tech, software or data science, there's an excellent chance you use tools that are built upon open source software. This is software that's built and distributed not for a profit, b...

Listen

Scikit + Optimization = Scikit-Optimize from 2016-09-12T01:54:59

We're excited to welcome a guest, Tim Head, who is one of the maintainers of the scikit-optimize package. With all the talk about optimization lately, it felt appropriate to get in a few words wit...

Listen

Two Cultures: Machine Learning and Statistics from 2016-09-05T01:50:05

It's a funny thing to realize, but data science modeling is usually about either explainability, interpretation and understanding, or it's about predictive accuracy. But usually not both--optimizi...

Listen

Optimization Solutions from 2016-08-29T02:01:42

You've got an optimization problem to solve, and a less-than-forever amount of time in which to solve it. What do? Use a heuristic optimization algorithm, like a hill climber or simulated anneali...

Listen

Optimization Problems from 2016-08-22T00:25:56

If modeling is about predicting the unknown, optimization tries to answer the question of what to do, what decision to make, to get the best results out of a given situation. Sometimes that's stra...

Listen

Multi-level modeling for understanding DEADLY RADIOACTIVE GAS from 2016-08-15T01:49:42

Ok, this episode is only sort of about DEADLY RADIOACTIVE GAS. It's mostly about multilevel modeling, which is a way of building models with data that has distinct, related subgroups within it. W...

Listen

How Polls Got Brexit "Wrong" from 2016-08-08T01:37:18

Continuing the discussion of how polls do (and sometimes don't) tell us what to expect in upcoming elections--let's take a concrete example from the recent past, shall we? The Brexit referendum wa...

Listen

Election Forecasting from 2016-08-01T02:40:35

Not sure if you heard, but there's an election going on right now. Polls, surveys, and projections about, as far as the eye can see. How to make sense of it all? How are the projections made? W...

Listen

Machine Learning for Genomics from 2016-07-25T02:14:47

Genomics data is some of the biggest #bigdata, and doing machine learning on it is unlocking new ways of thinking about evolution, genomic diseases like cancer, and what really makes each of us dif...

Listen

Climate Modeling from 2016-07-18T02:26:02

Hot enough for you? Climate models suggest that it's only going to get warmer in the coming years. This episode unpacks those models, so you understand how they work. A lot of the episodes we ...

Listen

Reinforcement Learning Gone Wrong from 2016-07-11T02:42:49

Last week’s episode on artificial intelligence gets a huge payoff this week—we’ll explore a wonderful couple of papers about all the ways that artificial intelligence can go wrong. Malevolent acto...

Listen

Reinforcement Learning for Artificial Intelligence from 2016-07-03T18:28:57

There’s a ton of excitement about reinforcement learning, a form of semi-supervised machine learning that underpins a lot of today’s cutting-edge artificial intelligence algorithms. Here’s a crash...

Listen

Differential Privacy: how to study people without being weird and gross from 2016-06-27T01:53:13

Apple wants to study iPhone users' activities and use it to improve performance. Google collects data on what people are doing online to try to improve their Chrome browser. Do you like the idea ...

Listen

How the sausage gets made from 2016-06-20T02:25:23

Something a little different in this episode--we'll be talking about the technical plumbing that gets our podcast from our brains to your ears. As it turns out, it's a multi-step bucket brigade pr...

Listen

SMOTE: makin' yourself some fake minority data from 2016-06-13T03:06:33

Machine learning on imbalanced classes: surprisingly tricky. Many (most?) algorithms tend to just assign the majority class label to all the data and call it a day. SMOTE is an algorithm for manu...

Listen

Conjoint Analysis: like AB testing, but on steroids from 2016-06-06T02:13:37

Conjoint analysis is like AB tester, but more bigger more better: instead of testing one or two things, you can test potentially dozens of options. Where might you use something like this? Well, ...

Listen

Traffic Metering Algorithms from 2016-05-30T01:57:10

This episode is for all you (us) traffic nerds--we're talking about the hidden structure underlying traffic on-ramp metering systems. These systems slow down the flow of traffic onto highways so t...

Listen

Um Detector 2: The Dynamic Time Warp from 2016-05-23T02:05:01

One tricky thing about working with time series data, like the audio data in our "um" detector (remember that? because we barely do...), is that sometimes events look really similar but one is a l...

Listen

Inside a Data Analysis: Fraud Hunting at Enron from 2016-05-16T02:36:10

It's storytime this week--the story, from beginning to end, of how Katie designed and built the main project for Udacity's Intro to Machine Learning class, when she was developing the course. The ...

Listen

What's the biggest #bigdata? from 2016-05-09T01:28:21

Data science and is often mentioned in the same breath as big data. But how big is big data? And who has the biggest big data? CERN? Youtube? ... Something (or someone) else?Relevant link: h...

Listen

Data Contamination from 2016-05-02T02:24:06

Supervised machine learning assumes that the features and labels used for building a classifier are isolated from each other--basically, that you can't cheat by peeking. Turns out this can be easi...

Listen

Model Interpretation (and Trust Issues) from 2016-04-25T00:45:04

Machine learning algorithms can be black boxes--inputs go in, outputs come out, and what happens in the middle is anybody's guess. But understanding how a model arrives at an answer is critical fo...

Listen

Updates! Political Science Fraud and AlphaGo from 2016-04-18T02:48:04

We've got updates for you about topics from past shows! First, the political science scandal of the year 2015 has a new chapter, we'll remind you about the original story and then dive into what h...

Listen

Ecological Inference and Simpson's Paradox from 2016-04-11T02:43:12

Simpson's paradox is the data science equivalent of looking through one eye and seeing a very clear trend, and then looking through the other eye and seeing the very clear opposite trend. In one c...

Listen

Discriminatory Algorithms from 2016-04-04T02:30:09

Sometimes when we say an algorithm discriminates, we mean it can tell the difference between two types of items. But in this episode, we'll talk about another, more troublesome side to discriminat...

Listen

Recommendation Engines and Privacy from 2016-03-28T02:46:45

This episode started out as a discussion of recommendation engines, like Netflix uses to suggest movies. There's still a lot of that in here. But a related topic, which is both interesting and im...

Listen

Neural nets play cops and robbers (AKA generative adverserial networks) from 2016-03-21T02:58:49

One neural net is creating counterfeit bills and passing them off to a second neural net, which is trying to distinguish the real money from the fakes. Result: two neural nets that are better than...

Listen

A Data Scientist's View of the Fight against Cancer from 2016-03-14T03:26:27

In this episode, we're taking many episodes' worth of insights and unpacking an extremely complex and important question--in what ways are we winning the fight against cancer, where might that figh...

Listen

Congress Bots and DeepDrumpf from 2016-03-11T04:17:29

Hey, sick of the election yet? Fear not, there are algorithms that can automagically generate political-ish speech so that we never need to be without an endless supply of Congressional speeches a...

Listen

Multi - Armed Bandits from 2016-03-07T02:44:17

Multi-armed bandits: how to take your randomized experiment and make it harder better faster stronger. Basically, a multi-armed bandit experiment allows you to optimize for both learning and makin...

Listen

Experiments and Messy, Tricky Causality from 2016-03-04T03:54:04

"People with a family history of heart disease are more likely to eat healthy foods, and have a high incidence of heart attacks." Did the healthy food cause the heart attacks? Probably not. But ...

Listen

Backpropagation from 2016-02-29T03:58:10

The reason that neural nets are taking over the world right now is because they can be efficiently trained with the backpropagation algorithm. In short, backprop allows you to adjust the weights o...

Listen

Text Analysis on the State Of The Union from 2016-02-26T03:51:42

First up in this episode: a crash course in natural language processing, and important steps if you want to use machine learning techniques on text data. Then we'll take that NLP know-how and talk...

Listen

Paradigms in Artificial Intelligence from 2016-02-22T04:32:25

Artificial intelligence includes a number of different strategies for how to make machines more intelligent, and often more human-like, in their ability to learn and solve problems. An ambitious g...

Listen

Survival Analysis from 2016-02-19T03:44:06

Survival analysis is all about studying how long until an event occurs--it's used in marketing to study how long a customer stays with a service, in epidemiology to estimate the duration of surviva...

Listen

Gravitational Waves from 2016-02-15T02:46:22

All aboard the gravitational waves bandwagon--with the first direct observation of gravitational waves announced this week, Katie's dusting off her physics PhD for a very special gravity-related ep...

Listen

The Turing Test from 2016-02-12T04:11:23

Let's imagine a future in which a truly intelligent computer program exists. How would it convince us (humanity) that it was intelligent? Alan Turing's answer to this question, proposed over 60 y...

Listen

Item Response Theory: how smart ARE you? from 2016-02-08T03:37:58

Psychometrics is all about measuring the psychological characteristics of people; for example, scholastic aptitude. How is this done? Tests, of course! But there's a chicken-and-egg problem here...

Listen

Go! from 2016-02-05T04:52:36

As you may have heard, a computer beat a world-class human player in Go last week. As recently as a year ago the prediction was that it would take a decade to get to this point, yet here we are, i...

Listen

Great Social Networks in History from 2016-02-01T04:22:02

The Medici were one of the great ruling families of Europe during the Renaissance. How did they come to rule? Not power, or money, or armies, but through the strength of their social network. An...

Listen

How Much to Pay a Spy (and a lil' more auctions) from 2016-01-29T05:36:33

A few small encores on auction theory, and then--how can you value a piece of information before you know what it is? Decision theory has some pointers. Some highly relevant information if you ar...

Listen

Sold! Auctions (Part 2) from 2016-01-25T02:58:07

The Google ads auction is a special kind of auction, one you might not know as well as the famous English auction (which we talked about in the last episode). But if it's what Google uses to sell ...

Listen

Going Once, Going Twice: Auctions (Part 1) from 2016-01-22T03:40:24

The Google AdWords algorithm is (famously) an auction system for allocating a massive amount of online ad space in real time--with that fascinating use case in mind, this episode is part one in a t...

Listen

Chernoff Faces and Minard Maps from 2016-01-18T03:38:33

A data visualization extravaganza in this episode, as we discuss Chernoff faces (you: "faces? huh?" us: "oh just you wait") and the greatest data visualization of all time, or at least the Napoleon...

Listen

t-SNE: Reduce Your Dimensions, Keep Your Clusters from 2016-01-15T04:05:49

Ever tried to visualize a cluster of data points in 40 dimensions? Or even 4, for that matter? We prefer to stick to 2, or maybe 3 if we're feeling well-caffeinated. The t-SNE algorithm is one o...

Listen

The [Expletive Deleted] Problem from 2016-01-11T04:23:53

The town of [expletive deleted], England, is responsible for the clbuttic [expletive deleted]problem. This week on Linear Digressions: we try really hard not to swear too much. Related links:http...

Listen

Unlabeled Supervised Learning--whaaa? from 2016-01-08T03:26:56

In order to do supervised learning, you need a labeled training dataset. Or do you...? Relevant links:http://www.cs.columbia.edu/~dplewis/candidacy/goldman00enhancing.pdf

Listen

Hacking Neural Nets from 2016-01-05T02:56:18

Machine learning: it can be fooled, just like you or me. Here's one of our favorite examples, a study into hacking neural networks. Relevant links:http://arxiv.org/pdf/1412.1897v4.pdf

Listen

Zipf's Law from 2015-12-31T18:08:17

Zipf's law is related to the statistics of how word usage is distributed. As it turns out, this is also strikingly reminiscent of how income is distributed, and populations of cities, and bug repo...

Listen

Indie Announcement from 2015-12-30T15:57:02

We've gone indie! Which shouldn't change anything about the podcast that you know and love, but we're super excited to keep bringing you Linear Digressions as a fully independent podcast. Some li...

Listen

Portrait Beauty from 2015-12-27T13:34:44

It's Da Vinci meets Skynet: what makes a portrait beautiful, according to a machine learning algorithm. Snap a selfie and give us a listen.

Listen

The Cocktail Party Problem from 2015-12-18T00:17:31

Grab a cocktail, put on your favorite karaoke track, and let’s talk some more about disentangling audio data!

Listen

A Criminally Short Introduction to Semi Supervised Learning from 2015-12-04T03:13:55

Because there are more interesting problems than there are labeled datasets, semi-supervised learning provides a framework for getting feedback from the environment as a proxy for labels of what's ...

Listen

Thresholdout: Down with Overfitting from 2015-11-27T17:55:04

Overfitting to your training data can be avoided by evaluating your machine learning algorithm on a holdout test dataset, but what about overfitting to the test data? Turns out it can be done, eas...

Listen

The State of Data Science from 2015-11-10T04:36:40

How many data scientists are there, where do they live, where do they work, what kind of tools do they use, and how do they describe themselves? RJMetrics wanted to know the answers to these quest...

Listen

Data Science for Making the World a Better Place from 2015-11-06T03:43:25

There's a good chance that great data science is going on close to you, and that it's going toward making your city, state, country, and planet a better place. Not all the data science questions b...

Listen

Kalman Runners from 2015-10-29T03:10:02

The Kalman Filter is an algorithm for taking noisy measurements of dynamic systems and using them to get a better idea of the underlying dynamics than you could get from a simple extrapolation. If...

Listen

Neural Net Inception from 2015-10-23T02:25:48

When you sleep, the neural pathways in your brain take the "white noise" of your resting brain, mix in your experiences and imagination, and the result is dreams (that is a highly unscientific expl...

Listen

Benford's Law from 2015-10-16T03:30:43

Sometimes numbers are... weird. Benford's Law is a favorite example of this for us--it's a law that governs the distribution of the first digit in certain types of numbers. As it turns out, if yo...

Listen

Guinness from 2015-10-07T03:30:33

Not to oversell it, but the student's t-test has got to have the most interesting history of any statistical test. Which is saying a lot, right? Add some boozy statistical trivia to your arsenal ...

Listen

PFun with P Values from 2015-09-02T03:24:36

Doing some science, and want to know if you might have found something? Or maybe you've just accomplished the scientific equivalent of going fishing and reeling in an old boot? Frequentist p-valu...

Listen

Watson from 2015-08-25T02:26:20

This machine learning algorithm beat the human champions at Jeopardy. What is... Watson?

Listen

Bayesian Psychics from 2015-08-18T00:05:04

Come get a little "out there" with us this week, as we use a meta-study of extrasensory perception (or ESP, often used in the same sentence as "psychics") to chat about Bayesian vs. frequentist sta...

Listen

Troll Detection from 2015-08-07T20:56:36

Ever found yourself wasting time reading online comments from trolls? Of course you have; we've all been there (it's 4 AM but I can't turn off the computer and go to sleep--someone on the internet...

Listen

Yiddish Translation from 2015-08-03T03:06:39

Imagine a language that is mostly spoken rather than written, contains many words in other languages, and has relatively little written overlap with English. Now imagine writing a machine-learning...

Listen

Modeling Particles in Atomic Bombs from 2015-07-06T23:30:15

In a fun historical journey, Katie and Ben explore the history of the Manhattan Project, discuss the difficulties in modeling particle movement in atomic bombs with only punch-card computers and in...

Listen

Random Number Generation from 2015-06-19T18:49:55

Let's talk about randomness! Although randomness is pervasive throughout the natural world, it's surprisingly difficult to generate random numbers. And even if your numbers look random (but actuall...

Listen

Electoral Insights (Part 2) from 2015-06-09T02:46:17

Following up on our last episode about how experiments can be performed in political science, now we explore a high-profile case of an experiment gone wrong. An extremely high-profile paper that ...

Listen

Electoral Insights (Part 1) from 2015-06-05T20:38

The first of our two-parter discussing the recent electoral data fraud case. The results of the study in question were covered widely, including by This American Life (who later had to issue a retr...

Listen

Falsifying Data from 2015-06-01T21:04:10

In the first of a few episodes on fraud in election research, we’ll take a look at a case study from a previous Presidential election, where polling results were faked. What are some telltale si...

Listen

Reporter Bot from 2015-05-20T23:16:18

There’s a big difference between a table of numbers or statistics, and the underlying story that a human might tell about how those numbers were generated. Think about a baseball game—the game st...

Listen

Careers in Data Science from 2015-05-16T05:43:44

Let’s talk money. As a “hot” career right now, data science can pay pretty well. But for an individual person matched with a specific job or industry, how much should someone expect to make? Sinc...

Listen

That's "Dr Katie" to You from 2015-05-14T17:37:48

Katie successfully defended her thesis! We celebrate her return, and talk a bit about what getting a PhD in Physics is like.

Listen

Neural Nets (Part 2) from 2015-05-11T14:37:51

In the last episode, we zipped through neural nets and got a quick idea of how they work and why they can be so powerful. Here’s the real payoff of that work: In this episode, we’ll talk about a b...

Listen

Neural Nets (Part 1) from 2015-05-01T18:59:28

There is no known learning algorithm that is more flexible and powerful than the human brain. That's quite inspirational, if you think about it--to level up machine learning, maybe we should be goi...

Listen

Inferring Authorship (Part 2) from 2015-04-28T16:56:24

Now that we’re up to speed on the classic author ID problem (who wrote the unsigned Federalist Papers?), we move onto a couple more contemporary examples. First, J.K. Rowling was famously outed u...

Listen

Inferring Authorship (Part 1) from 2015-04-16T17:25:21

This episode is inspired by one of our projects for Intro to Machine Learning: given a writing sample, can you use machine learning to identify who wrote it? Turns out that the answer is yes, a per...

Listen

Statistical Mistakes and the Challenger Disaster from 2015-04-06T19:36:56

After the Challenger exploded in 1986, killing all 7 astronauts aboard, an investigation into the cause was immediately launched. In the cold temperatures the night before the launch, the o-rings...

Listen

Genetics and Um Detection (HMM Part 2) from 2015-03-25T17:29:32

In part two of our series on Hidden Markov Models (HMMs), we talk to Katie and special guest Francesco about more useful and novel applications of HMMs. We revisit Katie's "Um Detector," and hear a...

Listen

Introducing Hidden Markov Models (HMM Part 1) from 2015-03-24T15:57:03

Wikipedia says, "A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states." What does that even ...

Listen

Monte Carlo For Physicists from 2015-03-12T23:18:01

This is another physics-centered podcast, about an ML-backed particle identification tool that we use to figure out what kind of particle caused a particular blob in the detector. But in this case,...

Listen

Random Kanye from 2015-03-04T23:04:45

Ever feel like you could randomly assemble words from a certain vocabulary and make semi-coherent Kanye West lyrics? Or technical documentation, imitations of local newscasters, your politically ou...

Listen

Lie Detectors from 2015-02-25T18:20:51

Often machine learning discussions center around algorithms, or features, or datasets--this one centers around interpretation, and ethics. Suppose you could use a technology like fMRI to see what...

Listen

The Enron Dataset from 2015-02-09T00:00

In 2000, Enron was one of the largest and companies in the world, praised far and wide for its innovations in energy distribution and many other markets. By 2002, it was apparent that many bad app...

Listen

Labels and Where To Find Them from 2015-02-04T02:30:47

Supervised classification is built on the backs of labeled datasets, but a good set of labels can be hard to find. Great data is everywhere, but the corresponding labels can sometimes be really tr...

Listen

Um Detector 1 from 2015-01-23T20:16:12

So, um... what about machine learning for audio applications? In the course of starting this podcast, we've edited out a lot of "um"'s from our raw audio files. It's gotten now to the point that,...

Listen

Better Facial Recognition with Fisherfaces from 2015-01-07T01:33:50

Now that we know about eigenfaces (if you don't, listen to the previous episode), let's talk about how it breaks down. Variations that are trivial to humans when identifying faces can really mess...

Listen

Facial Recognition with Eigenfaces from 2015-01-07T01:30:40

A true classic topic in ML: Facial recognition is very high-dimensional, meaning that each picture can have millions of pixels, each of which can be a single feature. It's computationally expensive...

Listen

Stats of World Series Streaks from 2014-12-17T00:41:39

Baseball is characterized by a high level of equality between teams; even the best teams might only have 55% win percentages (contrast this with college football, where teams go undefeated pretty r...

Listen

Computers Try to Tell Jokes from 2014-11-26T18:59:56

Computers are capable of many impressive feats, but making you laugh is usually not one of them. Or could it be? This episode will talk about a custom-built machine learning algorithm that searches...

Listen

How Outliers Helped Defeat Cholera from 2014-11-22T00:00

In the 1850s, there were a lot of things we didn’t know yet: how to create an airplane, how to split an atom, or how to control the spread of a common but deadly disease: cholera. When a cholera...

Listen

Hunting for the Higgs from 2014-11-16T00:00

Machine learning and particle physics go together like peanut butter and jelly--but this is a relatively new development. For many decades, physicists looked through their fairly large datasets ...

Listen

Podcasts by Linear Digressions

All episodes

So long, and thanks for all the fish from 2020-07-26T23:32:44

A Reality Check on AI-Driven Medical Assistants from 2020-07-19T23:51:31

A Data Science Take on Open Policing Data from 2020-07-13T02:02:39

The Data Science Open Source Ecosystem from 2020-06-29T02:34:48

Criminology and Data Science from 2020-06-15T01:26:26

Racism, the criminal justice system, and data science from 2020-06-07T23:33:53

An interstitial word from Ben from 2020-06-05T01:38:43

Convolutional Neural Networks from 2020-05-31T21:46:31

Protecting Individual-Level Census Data with Differential Privacy from 2020-05-18T01:49:22

Causal Trees from 2020-05-11T01:34:33

The Grammar Of Graphics from 2020-05-04T01:12:53

Gaussian Processes from 2020-04-27T01:33:43

Keeping ourselves honest when we work with observational healthcare data from 2020-04-20T02:43:37

Changing our formulation of AI to avoid runaway risks: Interview with Prof. Stuart Russell from 2020-04-13T01:55:01

Putting machine learning into a database from 2020-04-06T01:51:56

The work-from-home episode from 2020-03-29T22:23:42

Understanding Covid-19 transmission: what the data suggests about how the disease spreads from 2020-03-23T01:03:34

Network effects re-release: when the power of a public health measure lies in widespread adoption from 2020-03-15T22:43:38

Causal inference when you can't experiment: difference-in-differences and synthetic controls from 2020-03-09T01:39:19

Better know a distribution: the Poisson distribution from 2020-03-02T02:55:28

The Lottery Ticket Hypothesis from 2020-02-23T23:03:25

Interesting technical issues prompted by GDPR and data privacy concerns from 2020-02-17T01:50:20

Thinking of data science initiatives as innovation initiatives from 2020-02-10T01:10:21

Building a curriculum for educating data scientists: Interview with Prof. Xiao-Li Meng from 2020-02-02T23:36:23

Running experiments when there are network effects from 2020-01-27T00:13:52

Zeroing in on what makes adversarial examples possible from 2020-01-20T02:41:20

Unsupervised Dimensionality Reduction: UMAP vs t-SNE from 2020-01-13T00:53:19

Data scientists: beware of simple metrics from 2020-01-05T22:54:57

Communicating data science, from academia to industry from 2019-12-30T01:53:14

Optimizing for the short-term vs. the long-term from 2019-12-23T02:50:53

Interview with Prof. Andrew Lo, on using data science to inform complex business decisions from 2019-12-16T03:15:09

Using machine learning to predict drug approvals from 2019-12-08T22:56:05

Facial recognition, society, and the law from 2019-12-02T03:14:14

Lessons learned from doing data science, at scale, in industry from 2019-11-25T00:45:42

Varsity A/B Testing from 2019-11-18T02:09:46

The Care and Feeding of Data Scientists: Growing Careers from 2019-11-11T03:44:18

The Care and Feeding of Data Scientists: Recruiting and Hiring Data Scientists from 2019-11-04T00:21:56

The Care and Feeding of Data Scientists: Becoming a Data Science Manager from 2019-10-28T01:27:58

Procella: YouTube's super-system for analytics data storage from 2019-10-21T01:27:45

What's *really* so hard about feature engineering? from 2019-10-06T22:37:49

Data storage for analytics: stars and snowflakes from 2019-09-30T11:22:15

Data storage: transactions vs. analytics from 2019-09-23T01:49:59

GROVER: an algorithm for making, and detecting, fake news from 2019-09-16T03:21:34

Data science teams as innovation initiatives from 2019-09-09T02:24:55

Organizational Models for Data Scientists from 2019-08-25T23:06:52

Data Shapley from 2019-08-19T02:38:16

Putting the "science" in data science: the scientific method, the null hypothesis, and p-hacking from 2019-07-29T01:30:54

Interleaving from 2019-07-22T12:20:58

Deepfakes from 2019-07-01T01:25:07

Revisiting Biased Word Embeddings from 2019-06-24T00:26:07

Attention in Neural Nets from 2019-06-17T00:28:35

Interview with Joel Grus from 2019-06-10T02:05:47

Re - Release: Factorization Machines from 2019-06-03T01:32:39

Re-release: Auto-generating websites with deep learning from 2019-05-27T02:01:11

Advice to those trying to get a first job in data science from 2019-05-19T21:50:13

Re - Release: Machine Learning Technical Debt from 2019-05-12T23:07:14

Estimating Software Projects, and Why It's Hard from 2019-05-05T22:27:24

The Black Hole Algorithm from 2019-04-29T00:55:57

Structure in AI from 2019-04-21T22:29:02

The Great Data Science Specialist vs. Generalist Debate from 2019-04-15T00:55:41

Google X, and Taking Risks the Smart Way from 2019-04-08T01:10:57

Statistical Significance in Hypothesis Testing from 2019-04-01T01:34:53

The Language Model Too Dangerous to Release from 2019-03-25T01:39:45

The cathedral and the bazaar from 2019-03-17T22:47:01

AlphaStar from 2019-03-11T01:18:26

Are machine learning engineers the new data scientists? from 2019-03-04T02:57:19

Interview with Alex Radovic, particle physicist turned machine learning researcher from 2019-02-25T01:59:03

K Nearest Neighbors from 2019-02-17T23:57:23

Not every deep learning paper is great. Is that a problem? from 2019-02-11T00:06:33

The Assumptions of Ordinary Least Squares from 2019-02-03T23:24:15

Quantile Regression from 2019-01-28T01:27:40

Heterogeneous Treatment Effects from 2019-01-20T23:57:56

Pre-training language models for natural language processing problems from 2019-01-14T00:42:31

Re-release: Word2Vec from 2018-12-31T01:56:03

Re - Release: The Cold Start Problem from 2018-12-23T20:23:33

Convex (and non-convex) Optimization from 2018-12-17T03:06:42

The Normal Distribution and the Central Limit Theorem from 2018-12-09T18:58:28

Software 2.0 from 2018-12-02T23:23:05

What's really so hard about feature engineering? from 2019-10-06T22:37:49