Podcasts by Linear Digressions

Linear Digressions

Linear Digressions is a podcast about machine learning and data science. Machine learning is being used to solve a ton of interesting problems, and to accomplish goals that were out of reach even a few short years ago.

Further podcasts by Ben Jaffe and Katie Malone

Podcast on the topic Technologie

All episodes

Linear Digressions
So long, and thanks for all the fish from 2020-07-26T23:32:44

All good things must come to an end, including this podcast. This is the last episode we plan to release, and it doesn’t cover data science—it’s mostly reminiscing, thanking our wonderful audience ...

Listen
Linear Digressions
A Reality Check on AI-Driven Medical Assistants from 2020-07-19T23:51:31

The data science and artificial intelligence community has made amazing strides in the past few years to algorithmically automate portions of the healthcare process. This episode looks at two compu...

Listen
Linear Digressions
A Data Science Take on Open Policing Data from 2020-07-13T02:02:39

A few weeks ago, we put out a call for data scientists interested in issues of race and racism, or people studying how those topics can be studied with data science methods, should get in touch to ...

Listen
Linear Digressions
The Data Science Open Source Ecosystem from 2020-06-29T02:34:48

Open source software is ubiquitous throughout data science, and enables the work of nearly every data scientist in some way or another. Open source projects, however, are disproportionately maintai...

Listen
Linear Digressions
Criminology and Data Science from 2020-06-15T01:26:26

This episode features Zach Drake, a working data scientist and PhD candidate in the Criminology, Law and Society program at George Mason University. Zach specializes in bringing data science method...

Listen
Linear Digressions
Racism, the criminal justice system, and data science from 2020-06-07T23:33:53

As protests sweep across the United States in the wake of the killing of George Floyd by a Minneapolis police officer, we take a moment to dig into one of the ways that data science perpetuates and...

Listen
Linear Digressions
An interstitial word from Ben from 2020-06-05T01:38:43

A message from Ben around algorithmic bias, and how our models are sometimes reflections of ourselves.

Listen
Linear Digressions
Convolutional Neural Networks from 2020-05-31T21:46:31

This is a re-release of an episode that originally aired on April 1, 2018 If you've done image recognition or computer vision tasks with a neural network, you've probably used a convolutional neur...

Listen
Linear Digressions
Protecting Individual-Level Census Data with Differential Privacy from 2020-05-18T01:49:22

The power of finely-grained, individual-level data comes with a drawback: it compromises the privacy of potentially anyone and everyone in the dataset. Even for de-identified datasets, there can be...

Listen
Linear Digressions
Causal Trees from 2020-05-11T01:34:33

What do you get when you combine the causal inference needs of econometrics with the data-driven methodology of machine learning? Usually these two don’t go well together (deriving causal conclusio...

Listen
Linear Digressions
The Grammar Of Graphics from 2020-05-04T01:12:53

You may not realize it consciously, but beautiful visualizations have rules. The rules are often implict and manifest themselves as expectations about how the data is summarized, presented, and ann...

Listen
Linear Digressions
Gaussian Processes from 2020-04-27T01:33:43

It’s pretty common to fit a function to a dataset when you’re a data scientist. But in many cases, it’s not clear what kind of function might be most appropriate—linear? quadratic? sinusoidal? some...

Listen
Linear Digressions
Keeping ourselves honest when we work with observational healthcare data from 2020-04-20T02:43:37

The abundance of data in healthcare, and the value we could capture from structuring and analyzing that data, is a huge opportunity. It also presents huge challenges. One of the biggest challenges ...

Listen
Linear Digressions
Changing our formulation of AI to avoid runaway risks: Interview with Prof. Stuart Russell from 2020-04-13T01:55:01

AI is evolving incredibly quickly, and thinking now about where it might go next (and how we as a species and a society should be prepared) is critical. Professor Stuart Russell, an AI expert at UC...

Listen
Linear Digressions
Putting machine learning into a database from 2020-04-06T01:51:56

Most data scientists bounce back and forth regularly between doing analysis in databases using SQL and building and deploying machine learning pipelines in R or python. But if we think ahead a few ...

Listen
Linear Digressions
The work-from-home episode from 2020-03-29T22:23:42

Many of us have the privilege of working from home right now, in an effort to keep ourselves and our family safe and slow the transmission of covid-19. But working from home is an adjustment for ma...

Listen
Linear Digressions
Understanding Covid-19 transmission: what the data suggests about how the disease spreads from 2020-03-23T01:03:34

Covid-19 is turning the world upside down right now. One thing that’s extremely important to understand, in order to fight it as effectively as possible, is how the virus spreads and especially how...

Listen
Linear Digressions
Network effects re-release: when the power of a public health measure lies in widespread adoption from 2020-03-15T22:43:38

This week’s episode is a re-release of a recent episode, which we don’t usually do but it seems important for understanding what we can all do to slow the spread of covid-19. In brief, public healt...

Listen
Linear Digressions
Causal inference when you can't experiment: difference-in-differences and synthetic controls from 2020-03-09T01:39:19

When you need to untangle cause and effect, but you can’t run an experiment, it’s time to get creative. This episode covers difference in differences and synthetic controls, two observational causa...

Listen
Linear Digressions
Better know a distribution: the Poisson distribution from 2020-03-02T02:55:28

This is a re-release of an episode that originally ran on October 21, 2018. The Poisson distribution is a probability distribution function used to for events that happen in time or space. It’s su...

Listen
Linear Digressions
The Lottery Ticket Hypothesis from 2020-02-23T23:03:25

Recent research into neural networks reveals that sometimes, not all parts of the neural net are equally responsible for the performance of the network overall. Instead, it seems like (in some neu...

Listen
Linear Digressions
Interesting technical issues prompted by GDPR and data privacy concerns from 2020-02-17T01:50:20

Data privacy is a huge issue right now, after years of consumers and users gaining awareness of just how much of their personal data is out there and how companies are using it. Policies like GDPR ...

Listen
Linear Digressions
Thinking of data science initiatives as innovation initiatives from 2020-02-10T01:10:21

Put yourself in the shoes of an executive at a big legacy company for a moment, operating in virtually any market vertical: you’re constantly hearing that data science is revolutionizing the world ...

Listen
Linear Digressions
Building a curriculum for educating data scientists: Interview with Prof. Xiao-Li Meng from 2020-02-02T23:36:23

As demand for data scientists grows, and it remains as relevant as ever that practicing data scientists have a solid methodological and technical foundation for their work, higher education institu...

Listen
Linear Digressions
Running experiments when there are network effects from 2020-01-27T00:13:52

Traditional A/B tests assume that whether or not one person got a treatment has no effect on the experiment outcome for another person. But that’s not a safe assumption, especially when there are n...

Listen
Linear Digressions
Zeroing in on what makes adversarial examples possible from 2020-01-20T02:41:20

Adversarial examples are really, really weird: pictures of penguins that get classified with high certainty by machine learning algorithms as drumsets, or random noise labeled as pandas, or any one...

Listen
Linear Digressions
Unsupervised Dimensionality Reduction: UMAP vs t-SNE from 2020-01-13T00:53:19

Dimensionality reduction redux: this episode covers UMAP, an unsupervised algorithm designed to make high-dimensional data easier to visualize, cluster, etc. It’s similar to t-SNE but has some adva...

Listen
Linear Digressions
Data scientists: beware of simple metrics from 2020-01-05T22:54:57

Picking a metric for a problem means defining how you’ll measure success in solving that problem. Which sounds important, because it is, but oftentimes new data scientists only get experience with ...

Listen
Linear Digressions
Communicating data science, from academia to industry from 2019-12-30T01:53:14

For something as multifaceted and ill-defined as data science, communication and sharing best practices across the field can be extremely valuable but also extremely, well, multifaceted and ill-def...

Listen
Linear Digressions
Optimizing for the short-term vs. the long-term from 2019-12-23T02:50:53

When data scientists run experiments, like A/B tests, it’s really easy to plan on a period of a few days to a few weeks for collecting data. The thing is, the change that’s being evaluated might ha...

Listen
Linear Digressions
Interview with Prof. Andrew Lo, on using data science to inform complex business decisions from 2019-12-16T03:15:09

This episode features Prof. Andrew Lo, the author of a paper that we discussed recently on Linear Digressions, in which Prof. Lo uses data to predict whether a medicine in the development pipeline ...

Listen
Linear Digressions
Using machine learning to predict drug approvals from 2019-12-08T22:56:05

One of the hottest areas in data science and machine learning right now is healthcare: the size of the healthcare industry, the amount of data it generates, and the myriad improvements possible in ...

Listen
Linear Digressions
Facial recognition, society, and the law from 2019-12-02T03:14:14

Facial recognition being used in everyday life seemed far-off not too long ago. Increasingly, it’s being used and advanced widely and with increasing speed, which means that our technical capabilit...

Listen
Linear Digressions
Lessons learned from doing data science, at scale, in industry from 2019-11-25T00:45:42

If you’ve taken a machine learning class, or read up on A/B tests, you likely have a decent grounding in the theoretical pillars of data science. But if you’re in a position to have actually built ...

Listen
Linear Digressions
Varsity A/B Testing from 2019-11-18T02:09:46

When you want to understand if doing something causes something else to happen, like if a change to a website causes and dip or rise in downstream conversions, the gold standard analysis method is ...

Listen
Linear Digressions
The Care and Feeding of Data Scientists: Growing Careers from 2019-11-11T03:44:18

In the third and final installment of a conversation with Michelangelo D’Agostino, VP of Data Science and Engineering at Shoprunner, about growing and mentoring data scientists on your team. Some o...

Listen
Linear Digressions
The Care and Feeding of Data Scientists: Recruiting and Hiring Data Scientists from 2019-11-04T00:21:56

This week’s episode is the second in a three-part interview series with Michelangelo D’Agostino, VP of Data Science at Shoprunner. This discussion centers on building a team, which means recruiting...

Listen
Linear Digressions
The Care and Feeding of Data Scientists: Becoming a Data Science Manager from 2019-10-28T01:27:58

Data science management isn’t easy, and many data scientists are finding themselves learning on the job how to manage data science teams as they get promoted into more formal leadership roles. O’Re...

Listen
Linear Digressions
Procella: YouTube's super-system for analytics data storage from 2019-10-21T01:27:45

If you’re trying to manage a project that serves up analytics data for a few very distinct uses, you’d be wise to consider having custom solutions for each use case that are optimized for the needs...

Listen
Linear Digressions
What's *really* so hard about feature engineering? from 2019-10-06T22:37:49

Feature engineering is ubiquitous but gets surprisingly difficult surprisingly fast. What could be so complicated about just keeping track of what data you have, and how you made it? A lot, as it t...

Listen
Linear Digressions
Data storage for analytics: stars and snowflakes from 2019-09-30T11:22:15

If you’re a data scientist or data engineer thinking about how to store data for analytics uses, one of the early choices you’ll have to make (or live with, if someone else made it) is how to lay o...

Listen
Linear Digressions
Data storage: transactions vs. analytics from 2019-09-23T01:49:59

Data scientists and software engineers both work with databases, but they use them for different purposes. So if you’re a data scientist thinking about the best way to store and access data for you...

Listen
Linear Digressions
GROVER: an algorithm for making, and detecting, fake news from 2019-09-16T03:21:34

There are a few things that seem to be very popular in discussions of machine learning algorithms these days. First is the role that algorithms play now, or might play in the future, when it comes ...

Listen
Linear Digressions
Data science teams as innovation initiatives from 2019-09-09T02:24:55

When a big, established company is thinking about their data science strategy, chances are good that whatever they come up with, it’ll be somewhat at odds with the company’s current structure and p...

Listen
Linear Digressions
Organizational Models for Data Scientists from 2019-08-25T23:06:52

When data science is hard, sometimes it’s because the algorithms aren’t converging or the data is messy, and sometimes it’s because of organizational or business issues: the data scientists aren’t ...

Listen
Linear Digressions
Data Shapley from 2019-08-19T02:38:16

We talk often about which features in a dataset are most important, but recently a new paper has started making the rounds that turns the idea of importance on its head: Data Shapley is an algorith...

Listen
Linear Digressions
Putting the "science" in data science: the scientific method, the null hypothesis, and p-hacking from 2019-07-29T01:30:54

The modern scientific method is one of the greatest (perhaps the greatest?) system we have for discovering knowledge about the world. It’s no surprise then that many data scientists have found thei...

Listen
Linear Digressions
Interleaving from 2019-07-22T12:20:58

If you’re Google or Netflix, and you have a recommendation or search system as part of your bread and butter, what’s the best way to test improvements to your algorithm? A/B testing is the canonica...

Listen
Linear Digressions
Deepfakes from 2019-07-01T01:25:07

Generative adversarial networks (GANs) are producing some of the most realistic artificial videos we’ve ever seen. These videos are usually called “deepfakes”. Even to an experienced eye, it can be...

Listen
Linear Digressions
Revisiting Biased Word Embeddings from 2019-06-24T00:26:07

The topic of bias in word embeddings gets yet another pass this week. It all started a few years ago, when an analogy task performed on Word2Vec embeddings showed some indications of gender bias ar...

Listen
Linear Digressions
Attention in Neural Nets from 2019-06-17T00:28:35

There’s been a lot of interest lately in the attention mechanism in neural nets—it’s got a colloquial name (who’s not familiar with the idea of “attention”?) but it’s more like a technical trick th...

Listen
Linear Digressions
Interview with Joel Grus from 2019-06-10T02:05:47

This week’s episode is a special one, as we’re welcoming a guest: Joel Grus is a data scientist with a strong software engineering streak, and he does an impressive amount of speaking, writing, and...

Listen
Linear Digressions
Re - Release: Factorization Machines from 2019-06-03T01:32:39

What do you get when you cross a support vector machine with matrix factorization? You get a factorization machine, and a darn fine algorithm for recommendation engines.

Listen
Linear Digressions
Re-release: Auto-generating websites with deep learning from 2019-05-27T02:01:11

We've already talked about neural nets in some detail (links below), and in particular we've been blown away by the way that image recognition from convolutional neural nets can be fed into recurre...

Listen
Linear Digressions
Advice to those trying to get a first job in data science from 2019-05-19T21:50:13

We often hear from folks wondering what advice we can give them as they search for their first job in data science. What does a hiring manager look for? Should someone focus on taking classes onlin...

Listen
Linear Digressions
Re - Release: Machine Learning Technical Debt from 2019-05-12T23:07:14

This week, we've got a fun paper by our friends at Google about the hidden costs of maintaining machine learning workflows. If you've worked in software before, you're probably familiar with the i...

Listen
Linear Digressions
Estimating Software Projects, and Why It's Hard from 2019-05-05T22:27:24

If you’re like most software engineers and, especially, data scientists, you find it really hard to make accurate estimates of how long a project will take to complete. Don’t feel bad: statistics i...

Listen
Linear Digressions
The Black Hole Algorithm from 2019-04-29T00:55:57

53.5 million light-years away, there’s a gigantic galaxy called M87 with something interesting going on inside it. Between Einstein’s theory of relativity and the motion of a group of stars in the ...

Listen
Linear Digressions
Structure in AI from 2019-04-21T22:29:02

As artificial intelligence algorithms get applied to more and more domains, a question that often arises is whether to somehow build structure into the algorithm itself to mimic the structure of th...

Listen
Linear Digressions
The Great Data Science Specialist vs. Generalist Debate from 2019-04-15T00:55:41

It’s not news that data scientists are expected to be capable in many different areas (writing software, designing experiments, analyzing data, talking to non-technical stakeholders). One thing tha...

Listen
Linear Digressions
Google X, and Taking Risks the Smart Way from 2019-04-08T01:10:57

If you work in data science, you’re well aware of the sheer volume of high-risk, high-reward projects that are hypothetically possible. The fact that they’re high-reward means they’re exciting to t...

Listen
Linear Digressions
Statistical Significance in Hypothesis Testing from 2019-04-01T01:34:53

When you are running an AB test, one of the most important questions is how much data to collect. Collect too little, and you can end up drawing the wrong conclusion from your experiment. But in a ...

Listen
Linear Digressions
The Language Model Too Dangerous to Release from 2019-03-25T01:39:45

OpenAI recently created a cutting-edge new natural language processing model, but unlike all their other projects so far, they have not released it to the public. Why? It seems to be a little too g...

Listen
Linear Digressions
The cathedral and the bazaar from 2019-03-17T22:47:01

Imagine you have two choices of how to build something: top-down and controlled, with a few people playing a master designer role, or bottom-up and free-for-all, with nobody playing an explicit arc...

Listen
Linear Digressions
AlphaStar from 2019-03-11T01:18:26

It’s time for our latest installation in the series on artificial intelligence agents beating humans at games that we thought were safe from the robots. In this case, the game is StarCraft, and the...

Listen
Linear Digressions
Are machine learning engineers the new data scientists? from 2019-03-04T02:57:19

For many data scientists, maintaining models and workflows in production is both a huge part of their job and not something they necessarily trained for if their background is more in statistics or...

Listen
Linear Digressions
Interview with Alex Radovic, particle physicist turned machine learning researcher from 2019-02-25T01:59:03

You’d be hard-pressed to find a field with bigger, richer, and more scientifically valuable data than particle physics. Years before “data scientist” was even a term, particle physicists were inven...

Listen
Linear Digressions
K Nearest Neighbors from 2019-02-17T23:57:23

K Nearest Neighbors is an algorithm with secrets. On one hand, the algorithm itself is as straightforward as possible: find the labeled points nearest the point that you need to predict, and make a...

Listen
Linear Digressions
Not every deep learning paper is great. Is that a problem? from 2019-02-11T00:06:33

Deep learning is a field that’s growing quickly. That’s good! There are lots of new deep learning papers put out every day. That’s good too… right? What if not every paper out there is particularly...

Listen
Linear Digressions
The Assumptions of Ordinary Least Squares from 2019-02-03T23:24:15

Ordinary least squares (OLS) is often used synonymously with linear regression. If you’re a data scientist, machine learner, or statistician, you bump into it daily. If you haven’t had the opportun...

Listen
Linear Digressions
Quantile Regression from 2019-01-28T01:27:40

Linear regression is a great tool if you want to make predictions about the mean value that an outcome will have given certain values for the inputs. But what if you want to predict the median? Or ...

Listen
Linear Digressions
Heterogeneous Treatment Effects from 2019-01-20T23:57:56

When data scientists use a linear regression to look for causal relationships between a treatment and an outcome, what they’re usually finding is the so-called average treatment effect. In other wo...

Listen
Linear Digressions
Pre-training language models for natural language processing problems from 2019-01-14T00:42:31

When you build a model for natural language processing (NLP), such as a recurrent neural network, it helps a ton if you’re not starting from zero. In other words, if you can draw upon other dataset...

Listen
Linear Digressions
Re-release: Word2Vec from 2018-12-31T01:56:03

Bringing you another old classic this week, as we gear up for 2019! See you next week with new content. Word2Vec is probably the go-to algorithm for vectorizing text data these days.  Which makes ...

Listen
Linear Digressions
Re - Release: The Cold Start Problem from 2018-12-23T20:23:33

We’re taking a break for the holidays, chilling with the dog and an eggnog (Katie) and the cat and some spiced cider (Ben). Here’s an episode from a while back for you to enjoy. See you again in 20...

Listen
Linear Digressions
Convex (and non-convex) Optimization from 2018-12-17T03:06:42

Convex optimization is one of the keys to data science, both because some problems straight-up call for optimization solutions and because popular algorithms like a gradient descent solution to ord...

Listen
Linear Digressions
The Normal Distribution and the Central Limit Theorem from 2018-12-09T18:58:28

When you think about it, it’s pretty amazing that we can draw conclusions about huge populations, even the whole world, based on datasets that are comparatively very small (a few thousand, or a few...

Listen
Linear Digressions
Software 2.0 from 2018-12-02T23:23:05

Neural nets are a way you can model a system, sure, but if you take a step back, squint, and tilt your head, they can also be called… software? Not in the sense that they’re written in code, but in...

Listen
Linear Digressions
Limitations of Deep Nets for Computer Vision from 2018-11-18T19:01:28

Deep neural nets have a deserved reputation as the best-in-breed solution for computer vision problems. But there are many aspects of human vision that we take for granted but where neural nets str...

Listen
Linear Digressions
Building Data Science Teams from 2018-11-12T03:16:46

At many places, data scientists don’t work solo anymore—it’s a team sport. But data science teams aren’t simply teams of data scientists working together. Instead, they’re usually cross-functional ...

Listen
Linear Digressions
Optimized Optimized Web Crawling from 2018-11-04T21:38:32

Last week’s episode, about methods for optimized web crawling logic, left off on a bit of a cliffhanger: the data scientists had found a solution to the problem, but it wasn’t something that the en...

Listen
Linear Digressions
Optimized Web Crawling from 2018-10-28T23:56:36

Got a fun optimization problem for you this week! It’s a two-for-one: how do you optimize the web crawling logic of an operation like Google search so that the results are, on average, as up-to-dat...

Listen
Linear Digressions
Searching for Datasets with Google from 2018-10-15T01:11:58

If you wanted to find a dataset of jokes, how would you do it? What about a dataset of podcast episodes? If your answer was “I’d try Google,” you might have been disappointed—Google is a great sear...

Listen
Linear Digressions
It's our fourth birthday from 2018-10-08T02:33:55

We started Linear Digressions 4 years ago… this isn’t a technical episode, just two buddies shooting the breeze about something we’ve somehow built together.

Listen
Linear Digressions
Gigantic Searches in Particle Physics from 2018-09-30T18:52:04

This week, we’re dusting off the ol’ particle physics PhD to bring you an episode about ambitious new model-agnostic searches for new particles happening at CERN. Traditionally, new particles have ...

Listen
Linear Digressions
Data Engineering from 2018-09-24T01:10:13

If you’re a data scientist, you know how important it is to keep your data orderly, clean, moving smoothly between different systems, well-documented… there’s a ton of work that goes into building ...

Listen
Linear Digressions
Text Analysis for Guessing the NYTimes Op-Ed Author from 2018-09-16T18:13:09

A very intriguing op-ed was published in the NY Times recently, in which the author (a senior official in the Trump White House) claimed to be a minor saboteur of sorts, acting with his or her coll...

Listen
Linear Digressions
The Three Types of Data Scientists, and What They Actually Do from 2018-09-09T19:00:09

If you've been in data science for more than a year or two, chances are you've noticed changes in the field as it's grown and matured. And if you're newer to the field, you may feel like there's a ...

Listen
Linear Digressions
Agile Development for Data Scientists, Part 2: Where Modifications Help from 2018-08-26T19:59:12

There's just too much interesting stuff at the intersection of agile software development and data science for us to be able to cover it all in one episode, so this week we're picking up where we l...

Listen
Linear Digressions
Agile Development for Data Scientists, Part 1: The Good from 2018-08-19T18:06:19

If you're a data scientist at a firm that does a lot of software building, chances are good that you've seen or heard engineers sometimes talking about "agile software development." If you don't wo...

Listen
Linear Digressions
Re - Release: How To Lose At Kaggle from 2018-08-13T02:31:51

We've got a classic for you this week as we take a week off for the dog days of summer. See you again next week! Competing in a machine learning competition on Kaggle is a kind of rite of passage ...

Listen
Linear Digressions
Troubling Trends In Machine Learning Scholarship from 2018-08-06T01:31:03

There's a lot of great machine learning papers coming out every day--and, if we're being honest, some papers that are not as great as we'd wish. In some ways this is symptomatic of a field that's g...

Listen
Linear Digressions
Can Fancy Running Shoes Cause You To Run Faster? from 2018-07-29T19:12:09

The stars aligned for me (Katie) this past weekend: I raced my first half-marathon in a long time and got to read a great article from the NY Times about a new running shoe that Nike claims can mak...

Listen
Linear Digressions
Compliance Bias from 2018-07-22T16:07:54

When you're using an AB test to understand the effect of a treatment, there are a lot of assumptions about how the treatment (and control, for that matter) get applied. For example, it's easy to th...

Listen
Linear Digressions
AI Winter from 2018-07-15T20:11:52

Artificial Intelligence has been widely lauded as a solution to almost any problem. But as we justapose the hype in the field against the real-world benefits we see, it raises the question: Are we ...

Listen
Linear Digressions
Rerelease: How to Find New Things to Learn from 2018-07-08T22:28:29

We like learning on vacation. And we're on vacation, so we thought we'd re-air this episode about how to learn. Original Episode: https://lineardigressions.com/episodes/2017/5/14/how-to-find-new-t...

Listen
Linear Digressions
Rerelease: Space Codes from 2018-07-02T04:36:56

We're on vacation on Mars, so we won't be communicating with you all directly this week. Though, if we wanted to, we could probably use this episode to help get started. Original Episode: http://l...

Listen
Linear Digressions
Rerelease: Anscombe's Quartet from 2018-06-25T01:20:25

We're on vacation, so we hope you enjoy this episode while we each sip cocktails on the beach. Original Episode: http://lineardigressions.com/episodes/2017/6/18/anscombes-quartetOriginal Summary: ...

Listen
Linear Digressions
Rerelease: Hurricanes Produced from 2018-06-18T17:00:14

Now that hurricane season is upon us again (and we are on vacation), we thought a look back on our hurricane forecasting episode was prudent. Stay safe out there.

Listen
Linear Digressions
GDPR from 2018-06-11T02:24:45

By now, you have probably heard of GDPR, the EU's new data privacy law. It's the reason you've been getting so many emails about everyone's updated privacy policy. In this episode, we talk about s...

Listen
Linear Digressions
Git for Data Scientists from 2018-06-03T17:52:23

If you're a data scientist, chances are good that you've heard of git, which is a system for version controlling code. Chances are also good that you're not quite as up on git as you want to be--gi...

Listen
Linear Digressions
Analytics Maturity from 2018-05-20T15:09:39

Data science and analytics are hot topics in business these days, but for a lot of folks looking to bring data into their organization, it can be hard to know where to start and what it looks like ...

Listen
Linear Digressions
SHAP: Shapley Values in Machine Learning from 2018-05-13T14:24:38

Shapley values in machine learning are an interesting and useful enough innovation that we figured hey, why not do a two-parter? Our last episode focused on explaining what Shapley values are: they...

Listen
Linear Digressions
Game Theory for Model Interpretability: Shapley Values from 2018-05-07T02:17:19

As machine learning models get into the hands of more and more users, there's an increasing expectation that black box isn't good enough: users want to understand why the model made a given predict...

Listen
Linear Digressions
AutoML from 2018-04-30T02:50:23

If you were a machine learning researcher or data scientist ten years ago, you might have spent a lot of time implementing individual algorithms like decision trees and neural networks by hand. If ...

Listen
Linear Digressions
CPUs, GPUs, TPUs: Hardware for Deep Learning from 2018-04-23T02:52:42

A huge part of the ascent of deep learning in the last few years is related to advances in computer hardware that makes it possible to do the computational heavy lifting required to build models wi...

Listen
Linear Digressions
A Technical Introduction to Capsule Networks from 2018-04-16T01:12:25

Last episode we talked conceptually about capsule networks, the latest and greatest computer vision innovation to come out of Geoff Hinton's lab. This week we're getting a little more into the tech...

Listen
Linear Digressions
A Conceptual Introduction to Capsule Networks from 2018-04-09T01:59:54

Convolutional nets are great for image classification... if this were 2016. But it's 2018 and Canada's greatest neural networker Geoff Hinton has some new ideas, namely capsule networks. Capsule ne...

Listen
Linear Digressions
Convolutional Neural Nets from 2018-04-02T01:40:08

If you've done image recognition or computer vision tasks with a neural network, you've probably used a convolutional neural net. This episode is all about the architecture and implementation detai...

Listen
Linear Digressions
Google Flu Trends from 2018-03-26T01:20:41

It's been a nasty flu season this year. So we were remembering a story from a few years back (but not covered yet on this podcast) about when Google tried to predict flu outbreaks faster than the C...

Listen
Linear Digressions
How to pick projects for a professional data science team from 2018-03-19T03:07:33

This week's episodes is for data scientists, sure, but also for data science managers and executives at companies with data science teams. These folks all think very differently about the same ques...

Listen
Linear Digressions
Autoencoders from 2018-03-12T01:47:10

Autoencoders are neural nets that are optimized for creating outputs that... look like the inputs to the network. Turns out this is a not-too-shabby way to do unsupervised machine learning with neu...

Listen
Linear Digressions
When Private Data Isn't Private Anymore from 2018-03-05T03:35:26

After all the back-patting around making data science datasets and code more openly available, we figured it was time to also dump a bucket of cold water on everyone's heads and talk about the thin...

Listen
Linear Digressions
What makes a machine learning algorithm "superhuman"? from 2018-02-26T04:52:57

A few weeks ago, we podcasted about a neural network that was being touted as "better than doctors" in diagnosing pneumonia from chest x-rays, and how the underlying dataset used to train the algor...

Listen
Linear Digressions
Open Data and Open Science from 2018-02-19T01:39:16

One interesting trend we've noted recently is the proliferation of papers, articles and blog posts about data science that don't just tell the result--they include data and code that allow anyone t...

Listen
Linear Digressions
Defining the quality of a machine learning production system from 2018-02-12T02:00:45

Building a machine learning system and maintaining it in production are two very different things. Some folks over at Google wrote a paper that shares their thoughts around all the items you might ...

Listen
Linear Digressions
Auto-generating websites with deep learning from 2018-02-04T23:02:11

We've already talked about neural nets in some detail (links below), and in particular we've been blown away by the way that image recognition from convolutional neural nets can be fed into recurre...

Listen
Linear Digressions
The Case for Learned Index Structures, Part 2: Hash Maps and Bloom Filters from 2018-01-29T02:15:43

Last week we started the story of how you could use a machine learning model in place of a data structure, and this week we wrap up with an exploration of Bloom Filters and Hash Maps. Just like las...

Listen
Linear Digressions
The Case for Learned Index Structures, Part 1: B-Trees from 2018-01-22T02:32:28

Jeff Dean and his collaborators at Google are turning the machine learning world upside down (again) with a recent paper about how machine learning models can be used as surprisingly effective subs...

Listen
Linear Digressions
Challenges with Using Machine Learning to Classify Chest X-Rays from 2018-01-15T01:57:21

Another installment in our "machine learning might not be a silver bullet for solving medical problems" series. This week, we have a high-profile blog post that has been making the rounds for the l...

Listen
Linear Digressions
The Fourier Transform from 2018-01-08T02:07:46

The Fourier transform is one of the handiest tools in signal processing for dealing with periodic time series data. Using a Fourier transform, you can break apart a complex periodic function into a...

Listen
Linear Digressions
Statistics of Beer from 2018-01-02T01:57:54

What better way to kick off a new year than with an episode on the statistics of brewing beer?

Listen
Linear Digressions
Re - Release: Random Kanye from 2017-12-24T19:07:48

We have a throwback episode for you today as we take the week off to enjoy the holidays. This week: what happens when you have a markov chain that generates mashup Kanye West lyrics with Bible vers...

Listen
Linear Digressions
Debiasing Word Embeddings from 2017-12-18T02:31:01

When we covered the Word2Vec algorithm for embedding words, we mentioned parenthetically that the word embeddings it produces can sometimes be a little bit less than ideal--in particular, gender bi...

Listen
Linear Digressions
The Kernel Trick and Support Vector Machines from 2017-12-11T01:58:41

Picking up after last week's episode about maximal margin classifiers, this week we'll go into the kernel trick and how that (combined with maximal margin algorithms) gives us the much-vaunted supp...

Listen
Linear Digressions
Maximal Margin Classifiers from 2017-12-04T04:03:02

Maximal margin classifiers are a way of thinking about supervised learning entirely in terms of the decision boundary between two classes, and defining that boundary in a way that maximizes the dis...

Listen
Linear Digressions
Re - Release: The Cocktail Party Problem from 2017-11-27T02:11:06

Grab a cocktail, put on your favorite karaoke track, and let’s talk some more about disentangling audio data!

Listen
Linear Digressions
Clustering with DBSCAN from 2017-11-20T03:08:14

DBSCAN is a density-based clustering algorithm for doing unsupervised learning. It's pretty nifty: with just two parameters, you can specify "dense" regions in your data, and grow those regions ou...

Listen
Linear Digressions
The Kaggle Survey on Data Science from 2017-11-13T02:49:44

Want to know what's going on in data science these days?  There's no better way than to analyze a survey with over 16,000 responses that recently released by Kaggle.  Kaggle asked practicing and as...

Listen
Linear Digressions
Machine Learning: The High Interest Credit Card of Technical Debt from 2017-11-06T04:35:17

This week, we've got a fun paper by our friends at Google about the hidden costs of maintaining machine learning workflows. If you've worked in software before, you're probably familiar with the i...

Listen
Linear Digressions
Improving Upon a First-Draft Data Science Analysis from 2017-10-30T01:38:28

There are a lot of good resources out there for getting started with data science and machine learning, where you can walk through starting with a dataset and ending up with a model and set of pred...

Listen
Linear Digressions
Survey Raking from 2017-10-23T02:51:49

It's quite common for survey respondents not to be representative of the larger population from which they are drawn. But if you're a researcher, you need to study the larger population using data...

Listen
Linear Digressions
Happy Hacktoberfest from 2017-10-16T01:46:19

It's the middle of October, so you've already made two pull requests to open source repos, right? If you have no idea what we're talking about, spend the next 20 minutes or so with us talking about...

Listen
Linear Digressions
Re - Release: Kalman Runners from 2017-10-09T02:28:24

In honor of the Chicago marathon this weekend (and due in large part to Katie recovering from running in it...) we have a re-release of an episode about Kalman filters, which is part algorithm part...

Listen
Linear Digressions
Neural Net Dropout from 2017-10-02T03:32:56

Neural networks are complex models with many parameters and can be prone to overfitting.  There's a surprisingly simple way to guard against this: randomly destroy connections between hidden units,...

Listen
Linear Digressions
Disciplined Data Science from 2017-09-25T01:49:41

As data science matures as a field, it's becoming clearer what attributes a data science team needs to have to elevate their work to the next level. Most of our episodes are about the cool work be...

Listen
Linear Digressions
Hurricane Forecasting from 2017-09-18T01:37:15

It's been a busy hurricane season in the Southeastern United States, with millions of people making life-or-death decisions based on the forecasts around where the hurricanes will hit and with what...

Listen
Linear Digressions
Finding Spy Planes with Machine Learning from 2017-09-11T02:11:22

There are law enforcement surveillance aircraft circling over the United States every day, and in this episode, we'll talk about how some folks at BuzzFeed used public data and machine learning to ...

Listen
Linear Digressions
Data Provenance from 2017-09-04T01:35

Software engineers are familiar with the idea of versioning code, so you can go back later and revive a past state of the system.  For data scientists who might want to reconstruct past models, tho...

Listen
Linear Digressions
Adversarial Examples from 2017-08-28T02:25:14

Even as we rely more and more on machine learning algorithms to help with everyday decision-making, we're learning more and more about how they're frighteningly easy to fool sometimes. Today we ha...

Listen
Linear Digressions
Jupyter Notebooks from 2017-08-21T01:09:32

This week's episode is just in time for JupyterCon in NYC, August 22-25... Jupyter notebooks are probably familiar to a lot of data nerds out there as a great open-source tool for exploring data, ...

Listen
Linear Digressions
Curing Cancer with Machine Learning is Super Hard from 2017-08-14T01:49:52

Today, a dispatch on what can go wrong when machine learning hype outpaces reality: a high-profile partnership between IBM Watson and MD Anderson Cancer Center has recently hit the rocks as it turn...

Listen
Linear Digressions
KL Divergence from 2017-08-07T03:07:15

Kullback Leibler divergence, or KL divergence, is a measure of information loss when you try to approximate one distribution with another distribution.  It comes to us originally from information t...

Listen
Linear Digressions
Sabermetrics from 2017-07-31T01:15:37

It's moneyball time! SABR (the Society for American Baseball Research) is the world's largest organization of statistics-minded baseball enthusiasts, who are constantly applying the craft of scien...

Listen
Linear Digressions
What Data Scientists Can Learn from Software Engineers from 2017-07-24T01:52:26

We're back again with friend of the pod Walt, former software engineer extraordinaire and current data scientist extraordinaire, to talk about some best practices from software engineering that are...

Listen
Linear Digressions
Software Engineering to Data Science from 2017-07-17T02:36

Data scientists and software engineers often work side by side, building out and scaling technical products and services that are data-heavy but also require a lot of software engineering to build ...

Listen
Linear Digressions
Re-Release: Fighting Cholera with Data, 1854 from 2017-07-10T00:19:56

This episode was first released in November 2014. In the 1850s, there were a lot of things we didn’t know yet: how to create an airplane, how to split an atom, or how to control the spread of a co...

Listen
Linear Digressions
Re-Release: Data Mining Enron from 2017-07-02T17:53:42

This episode was first release in February 2015. In 2000, Enron was one of the largest and companies in the world, praised far and wide for its innovations in energy distribution and many other ma...

Listen
Linear Digressions
Factorization Machines from 2017-06-26T02:23:14

What do you get when you cross a support vector machine with matrix factorization? You get a factorization machine, and a darn fine algorithm for recommendation engines.

Listen
Linear Digressions
Anscombe's Quartet from 2017-06-19T02:19:56

Anscombe's Quartet is a set of four datasets that have the same mean, variance and correlation but look very different. It's easy to think that having a good set of summary statistics (like mean, ...

Listen
Linear Digressions
Page Rank from 2017-06-05T01:46:35

The year: 1998.  The size of the web: 150 million pages.  The problem: information retrieval.  How do you find the "best" web pages to return in response to a query?  A graduate student named Larry...

Listen
Linear Digressions
Fractional Dimensions from 2017-05-29T02:54:46

We chat about fractional dimensions, and what the actual heck those are.

Listen
Linear Digressions
Things You Learn When Building Models for Big Data from 2017-05-22T01:44:13

As more and more data gets collected seemingly every day, and data scientists use that data for modeling, the technical limits associated with machine learning on big datasets keep getting pushed b...

Listen
Linear Digressions
How to Find New Things to Learn from 2017-05-15T01:49:26

If you're anything like us, you a) always are curious to learn more about data science and machine learning and stuff, and b) are usually overwhelmed by how much content is out there (not all of it...

Listen
Linear Digressions
Federated Learning from 2017-05-08T01:50:40

As machine learning makes its way into more and more mobile devices, an interesting question presents itself: how can we have an algorithm learn from training data that's being supplied as users in...

Listen
Linear Digressions
Word2Vec from 2017-05-01T02:17:36

Word2Vec is probably the go-to algorithm for vectorizing text data these days.  Which makes sense, because it is wicked cool.  Word2Vec has it all: neural networks, skip-grams and bag-of-words impl...

Listen
Linear Digressions
Feature Processing for Text Analytics from 2017-04-24T02:17:24

It seems like every day there's more and more machine learning problems that involve learning on text data, but text itself makes for fairly lousy inputs to machine learning algorithms.  That's why...

Listen
Linear Digressions
Education Analytics from 2017-04-17T02:09:26

This week we'll hop into the rapidly developing industry around predictive analytics for education. For many of the students who eventually drop out, data science is showing that there might be ea...

Listen
Linear Digressions
A Technical Deep Dive on Stanley, the First Self-Driving Car from 2017-04-10T01:50:01

In our follow-up episode to last week's introduction to the first self-driving car, we will be doing a technical deep dive this week and talking about the most important systems for getting a car t...

Listen
Linear Digressions
An Introduction to Stanley, the First Self-Driving Car from 2017-04-03T01:34:17

In October 2005, 23 cars lined up in the desert for a 140 mile race.  Not one of those cars had a driver.  This was the DARPA grand challenge to see if anyone could build an autonomous vehicle capa...

Listen
Linear Digressions
Feature Importance from 2017-03-27T01:53:25

Figuring out what features actually matter in a model is harder to figure out than you might first guess.  When a human makes a decision, you can just ask them--why did you do that?  But with machi...

Listen
Linear Digressions
Space Codes! from 2017-03-20T02:50:57

It's hard to get information to and from Mars.  Mars is very far away, and expensive to get to, and the bandwidth for passing messages with Earth is not huge.  The messages you do pass have to trav...

Listen
Linear Digressions
Finding (and Studying) Wikipedia Trolls from 2017-03-13T01:44:55

You may be shocked to hear this, but sometimes, people on the internet can be mean.  For some of us this is just a minor annoyance, but if you're a maintainer or contributor of a large project like...

Listen
Linear Digressions
A Sprint Through What's New in Neural Networks from 2017-03-06T03:27:12

Advances in neural networks are moving fast enough that, even though it seems like we talk about them all the time around here, it also always seems like we're barely keeping up.  So this week we h...

Listen
Linear Digressions
Stein's Paradox from 2017-02-27T02:51:41

When you're estimating something about some object that's a member of a larger group of similar objects (say, the batting average of a baseball player, who belongs to a baseball team), how should y...

Listen
Linear Digressions
Empirical Bayes from 2017-02-20T03:30:06

Say you're looking to use some Bayesian methods to estimate parameters of a system. You've got the normalization figured out, and the likelihood, but the prior... what should you use for a prior? ...

Listen
Linear Digressions
Endogenous Variables and Measuring Protest Effectiveness from 2017-02-13T03:31

Have you been out protesting lately, or watching the protests, and wondered how much effect they might have on lawmakers? It's a tricky question to answer, since usually we need randomly distribut...

Listen
Linear Digressions
Calibrated Models from 2017-02-06T01:56:12

Remember last week, when we were talking about how great the ROC curve is for evaluating models? How things change... This week, we're exploring calibrated risk models, because that's a kind of m...

Listen
Linear Digressions
Rock the ROC Curve from 2017-01-30T03:38:46

This week: everybody's favorite WWII-era classifier metric! But it's not just for winning wars, it's a fantastic go-to metric for all your classifier quality needs.

Listen
Linear Digressions
Ensemble Algorithms from 2017-01-23T02:31:26

If one machine learning model is good, are two models better? In a lot of cases, the answer is yes. If you build many ok models, and then bring them all together and use them in combination to ma...

Listen
Linear Digressions
How to evaluate a translation: BLEU scores from 2017-01-16T01:59:01

As anyone who's encountered a badly translated text could tell you, not all translations are created equal. Some translations are smooth, fluent and sound like a poet wrote them; some are jerky, n...

Listen
Linear Digressions
Zero Shot Translation from 2017-01-09T03:20:57

Take Google-size data, the flexibility of a neural net, and all (well, most) of the languages of the world, and what you end up with is a pile of surprises. This episode is about some interesting ...

Listen
Linear Digressions
Google Neural Machine Translation from 2017-01-02T01:44:23

Recently, Google swapped out the backend for Google Translate, moving from a statistical phrase-based method to a recurrent neural network. This marks a big change in methodology: the tried-and-tr...

Listen
Linear Digressions
Data and the Future of Medicine : Interview with Precision Medicine Initiative researcher Matt Might from 2016-12-26T01:19:24

Today we are delighted to bring you an interview with Matt Might, computer scientist and medical researcher extraordinaire and architect of President Obama's Precision Medicine Initiative. As the ...

Listen
Linear Digressions
Special Crossover Episode: Partially Derivative interview with White House Data Scientist DJ Patil from 2016-12-18T17:53:52

We have the pleasure of bringing you a very special crossover episode this week: our friends at Partially Derivative (another great podcast about data science, you should check it out) recently int...

Listen
Linear Digressions
How to Lose at Kaggle from 2016-12-12T04:28:57

Competing in a machine learning competition on Kaggle is a kind of rite of passage for data scientists. Losing unexpectedly at the very end of the contest is also something that a lot of us have e...

Listen
Linear Digressions
Attacking Discrimination in Machine Learning from 2016-12-05T03:38:54

Imagine there's an important decision to be made about someone, like a bank deciding whether to extend a loan, or a school deciding to admit a student--unfortunately, we're all too aware that discr...

Listen
Linear Digressions
Recurrent Neural Nets from 2016-11-28T02:47:44

This week, we're doing a crash course in recurrent neural networks--what the structural pieces are that make a neural net recurrent, how that structure helps RNNs solve certain time series problems...

Listen
Linear Digressions
Stealing a PIN with signal processing and machine learning from 2016-11-21T02:32:21

Want another reason to be paranoid when using the free coffee shop wifi? Allow us to introduce WindTalker, a system that cleverly combines a dose of signal processing with a dash of machine learni...

Listen
Linear Digressions
Neural Net Cryptography from 2016-11-14T04:06:57

Cryptography used to be the domain of information theorists and spies. There's a new player now: neural networks. Given the task of communicating securely, neural networks are inventing new encry...

Listen
Linear Digressions
Deep Blue from 2016-11-07T04:20:48

In 1997, Deep Blue was the IBM algorithm/computer that did what no one, at the time, though possible: it beat the world's best chess player. It turns out, though, that one of the most important mo...

Listen
Linear Digressions
Organizing Google's Datasets from 2016-10-31T02:17:26

If you're a data scientist, there's a good chance you're used to working with a lot of data. But there's a lot of data, and then there's Google-scale amounts of data. Keeping all that data organi...

Listen
Linear Digressions
Fighting Cancer with Data Science: Followup from 2016-10-24T01:58:41

A few months ago, Katie started on a project for the Vice President's Cancer Moonshot surrounding how data can be used to better fight cancer. The project is all wrapped up now, so we wanted to te...

Listen
Linear Digressions
The 19-year-old determining the US election from 2016-10-17T01:01:23

Sick of the presidential election yet? We are too, but there's still almost a month to go, so let's just embrace it together. This week, we'll talk about one of the presidential polls, which has ...

Listen
Linear Digressions
How to Steal a Model from 2016-10-09T22:57:24

What does it mean to steal a model? It means someone (the thief, presumably) can re-create the predictions of the model without having access to the algorithm itself, or the training data. Sound ...

Listen
Linear Digressions
Regularization from 2016-10-03T02:13:50

Lots of data is usually seen as a good thing. And it is a good thing--except when it's not. In a lot of fields, a problem arises when you have many, many features, especially if there's a somewha...

Listen
Linear Digressions
The Cold Start Problem from 2016-09-26T02:24:38

You might sometimes find that it's hard to get started doing something, but once you're going, it gets easier. Turns out machine learning algorithms, and especially recommendation engines, feel th...

Listen
Linear Digressions
Open Source Software for Data Science from 2016-09-19T04:27:40

If you work in tech, software or data science, there's an excellent chance you use tools that are built upon open source software. This is software that's built and distributed not for a profit, b...

Listen
Linear Digressions
Scikit + Optimization = Scikit-Optimize from 2016-09-12T01:54:59

We're excited to welcome a guest, Tim Head, who is one of the maintainers of the scikit-optimize package. With all the talk about optimization lately, it felt appropriate to get in a few words wit...

Listen
Linear Digressions
Two Cultures: Machine Learning and Statistics from 2016-09-05T01:50:05

It's a funny thing to realize, but data science modeling is usually about either explainability, interpretation and understanding, or it's about predictive accuracy. But usually not both--optimizi...

Listen
Linear Digressions
Optimization Solutions from 2016-08-29T02:01:42

You've got an optimization problem to solve, and a less-than-forever amount of time in which to solve it. What do? Use a heuristic optimization algorithm, like a hill climber or simulated anneali...

Listen
Linear Digressions
Optimization Problems from 2016-08-22T00:25:56

If modeling is about predicting the unknown, optimization tries to answer the question of what to do, what decision to make, to get the best results out of a given situation. Sometimes that's stra...

Listen
Linear Digressions
Multi-level modeling for understanding DEADLY RADIOACTIVE GAS from 2016-08-15T01:49:42

Ok, this episode is only sort of about DEADLY RADIOACTIVE GAS. It's mostly about multilevel modeling, which is a way of building models with data that has distinct, related subgroups within it. W...

Listen
Linear Digressions
How Polls Got Brexit "Wrong" from 2016-08-08T01:37:18

Continuing the discussion of how polls do (and sometimes don't) tell us what to expect in upcoming elections--let's take a concrete example from the recent past, shall we? The Brexit referendum wa...

Listen
Linear Digressions
Election Forecasting from 2016-08-01T02:40:35

Not sure if you heard, but there's an election going on right now. Polls, surveys, and projections about, as far as the eye can see. How to make sense of it all? How are the projections made? W...

Listen
Linear Digressions
Machine Learning for Genomics from 2016-07-25T02:14:47

Genomics data is some of the biggest #bigdata, and doing machine learning on it is unlocking new ways of thinking about evolution, genomic diseases like cancer, and what really makes each of us dif...

Listen
Linear Digressions
Climate Modeling from 2016-07-18T02:26:02

Hot enough for you? Climate models suggest that it's only going to get warmer in the coming years. This episode unpacks those models, so you understand how they work. A lot of the episodes we ...

Listen
Linear Digressions
Reinforcement Learning Gone Wrong from 2016-07-11T02:42:49

Last week’s episode on artificial intelligence gets a huge payoff this week—we’ll explore a wonderful couple of papers about all the ways that artificial intelligence can go wrong. Malevolent acto...

Listen
Linear Digressions
Reinforcement Learning for Artificial Intelligence from 2016-07-03T18:28:57

There’s a ton of excitement about reinforcement learning, a form of semi-supervised machine learning that underpins a lot of today’s cutting-edge artificial intelligence algorithms. Here’s a crash...

Listen
Linear Digressions
Differential Privacy: how to study people without being weird and gross from 2016-06-27T01:53:13

Apple wants to study iPhone users' activities and use it to improve performance. Google collects data on what people are doing online to try to improve their Chrome browser. Do you like the idea ...

Listen
Linear Digressions
How the sausage gets made from 2016-06-20T02:25:23

Something a little different in this episode--we'll be talking about the technical plumbing that gets our podcast from our brains to your ears. As it turns out, it's a multi-step bucket brigade pr...

Listen
Linear Digressions
SMOTE: makin' yourself some fake minority data from 2016-06-13T03:06:33

Machine learning on imbalanced classes: surprisingly tricky. Many (most?) algorithms tend to just assign the majority class label to all the data and call it a day. SMOTE is an algorithm for manu...

Listen
Linear Digressions
Conjoint Analysis: like AB testing, but on steroids from 2016-06-06T02:13:37

Conjoint analysis is like AB tester, but more bigger more better: instead of testing one or two things, you can test potentially dozens of options. Where might you use something like this? Well, ...

Listen
Linear Digressions
Traffic Metering Algorithms from 2016-05-30T01:57:10

This episode is for all you (us) traffic nerds--we're talking about the hidden structure underlying traffic on-ramp metering systems. These systems slow down the flow of traffic onto highways so t...

Listen
Linear Digressions
Um Detector 2: The Dynamic Time Warp from 2016-05-23T02:05:01

One tricky thing about working with time series data, like the audio data in our "um" detector (remember that? because we barely do...), is that sometimes events look really similar but one is a l...

Listen
Linear Digressions
Inside a Data Analysis: Fraud Hunting at Enron from 2016-05-16T02:36:10

It's storytime this week--the story, from beginning to end, of how Katie designed and built the main project for Udacity's Intro to Machine Learning class, when she was developing the course. The ...

Listen
Linear Digressions
What's the biggest #bigdata? from 2016-05-09T01:28:21

Data science and is often mentioned in the same breath as big data. But how big is big data? And who has the biggest big data? CERN? Youtube? ... Something (or someone) else?Relevant link: h...

Listen
Linear Digressions
Data Contamination from 2016-05-02T02:24:06

Supervised machine learning assumes that the features and labels used for building a classifier are isolated from each other--basically, that you can't cheat by peeking. Turns out this can be easi...

Listen
Linear Digressions
Model Interpretation (and Trust Issues) from 2016-04-25T00:45:04

Machine learning algorithms can be black boxes--inputs go in, outputs come out, and what happens in the middle is anybody's guess. But understanding how a model arrives at an answer is critical fo...

Listen
Linear Digressions
Updates! Political Science Fraud and AlphaGo from 2016-04-18T02:48:04

We've got updates for you about topics from past shows! First, the political science scandal of the year 2015 has a new chapter, we'll remind you about the original story and then dive into what h...

Listen
Linear Digressions
Ecological Inference and Simpson's Paradox from 2016-04-11T02:43:12

Simpson's paradox is the data science equivalent of looking through one eye and seeing a very clear trend, and then looking through the other eye and seeing the very clear opposite trend. In one c...

Listen
Linear Digressions
Discriminatory Algorithms from 2016-04-04T02:30:09

Sometimes when we say an algorithm discriminates, we mean it can tell the difference between two types of items. But in this episode, we'll talk about another, more troublesome side to discriminat...

Listen
Linear Digressions
Recommendation Engines and Privacy from 2016-03-28T02:46:45

This episode started out as a discussion of recommendation engines, like Netflix uses to suggest movies. There's still a lot of that in here. But a related topic, which is both interesting and im...

Listen
Linear Digressions
Neural nets play cops and robbers (AKA generative adverserial networks) from 2016-03-21T02:58:49

One neural net is creating counterfeit bills and passing them off to a second neural net, which is trying to distinguish the real money from the fakes. Result: two neural nets that are better than...

Listen
Linear Digressions
A Data Scientist's View of the Fight against Cancer from 2016-03-14T03:26:27

In this episode, we're taking many episodes' worth of insights and unpacking an extremely complex and important question--in what ways are we winning the fight against cancer, where might that figh...

Listen
Linear Digressions
Congress Bots and DeepDrumpf from 2016-03-11T04:17:29

Hey, sick of the election yet? Fear not, there are algorithms that can automagically generate political-ish speech so that we never need to be without an endless supply of Congressional speeches a...

Listen
Linear Digressions
Multi - Armed Bandits from 2016-03-07T02:44:17

Multi-armed bandits: how to take your randomized experiment and make it harder better faster stronger. Basically, a multi-armed bandit experiment allows you to optimize for both learning and makin...

Listen
Linear Digressions
Experiments and Messy, Tricky Causality from 2016-03-04T03:54:04

"People with a family history of heart disease are more likely to eat healthy foods, and have a high incidence of heart attacks." Did the healthy food cause the heart attacks? Probably not. But ...

Listen
Linear Digressions
Backpropagation from 2016-02-29T03:58:10

The reason that neural nets are taking over the world right now is because they can be efficiently trained with the backpropagation algorithm. In short, backprop allows you to adjust the weights o...

Listen
Linear Digressions
Text Analysis on the State Of The Union from 2016-02-26T03:51:42

First up in this episode: a crash course in natural language processing, and important steps if you want to use machine learning techniques on text data. Then we'll take that NLP know-how and talk...

Listen
Linear Digressions
Paradigms in Artificial Intelligence from 2016-02-22T04:32:25

Artificial intelligence includes a number of different strategies for how to make machines more intelligent, and often more human-like, in their ability to learn and solve problems. An ambitious g...

Listen
Linear Digressions
Survival Analysis from 2016-02-19T03:44:06

Survival analysis is all about studying how long until an event occurs--it's used in marketing to study how long a customer stays with a service, in epidemiology to estimate the duration of surviva...

Listen
Linear Digressions
Gravitational Waves from 2016-02-15T02:46:22

All aboard the gravitational waves bandwagon--with the first direct observation of gravitational waves announced this week, Katie's dusting off her physics PhD for a very special gravity-related ep...

Listen
Linear Digressions
The Turing Test from 2016-02-12T04:11:23

Let's imagine a future in which a truly intelligent computer program exists. How would it convince us (humanity) that it was intelligent? Alan Turing's answer to this question, proposed over 60 y...

Listen
Linear Digressions
Item Response Theory: how smart ARE you? from 2016-02-08T03:37:58

Psychometrics is all about measuring the psychological characteristics of people; for example, scholastic aptitude. How is this done? Tests, of course! But there's a chicken-and-egg problem here...

Listen
Linear Digressions
Go! from 2016-02-05T04:52:36

As you may have heard, a computer beat a world-class human player in Go last week. As recently as a year ago the prediction was that it would take a decade to get to this point, yet here we are, i...

Listen
Linear Digressions
Great Social Networks in History from 2016-02-01T04:22:02

The Medici were one of the great ruling families of Europe during the Renaissance. How did they come to rule? Not power, or money, or armies, but through the strength of their social network. An...

Listen
Linear Digressions
How Much to Pay a Spy (and a lil' more auctions) from 2016-01-29T05:36:33

A few small encores on auction theory, and then--how can you value a piece of information before you know what it is? Decision theory has some pointers. Some highly relevant information if you ar...

Listen
Linear Digressions
Sold! Auctions (Part 2) from 2016-01-25T02:58:07

The Google ads auction is a special kind of auction, one you might not know as well as the famous English auction (which we talked about in the last episode). But if it's what Google uses to sell ...

Listen
Linear Digressions
Going Once, Going Twice: Auctions (Part 1) from 2016-01-22T03:40:24

The Google AdWords algorithm is (famously) an auction system for allocating a massive amount of online ad space in real time--with that fascinating use case in mind, this episode is part one in a t...

Listen
Linear Digressions
Chernoff Faces and Minard Maps from 2016-01-18T03:38:33

A data visualization extravaganza in this episode, as we discuss Chernoff faces (you: "faces? huh?" us: "oh just you wait") and the greatest data visualization of all time, or at least the Napoleon...

Listen
Linear Digressions
t-SNE: Reduce Your Dimensions, Keep Your Clusters from 2016-01-15T04:05:49

Ever tried to visualize a cluster of data points in 40 dimensions? Or even 4, for that matter? We prefer to stick to 2, or maybe 3 if we're feeling well-caffeinated. The t-SNE algorithm is one o...

Listen
Linear Digressions
The [Expletive Deleted] Problem from 2016-01-11T04:23:53

The town of [expletive deleted], England, is responsible for the clbuttic [expletive deleted]problem. This week on Linear Digressions: we try really hard not to swear too much. Related links:http...

Listen
Linear Digressions
Unlabeled Supervised Learning--whaaa? from 2016-01-08T03:26:56

In order to do supervised learning, you need a labeled training dataset. Or do you...? Relevant links:http://www.cs.columbia.edu/~dplewis/candidacy/goldman00enhancing.pdf

Listen
Linear Digressions
Hacking Neural Nets from 2016-01-05T02:56:18

Machine learning: it can be fooled, just like you or me. Here's one of our favorite examples, a study into hacking neural networks. Relevant links:http://arxiv.org/pdf/1412.1897v4.pdf

Listen
Linear Digressions
Zipf's Law from 2015-12-31T18:08:17

Zipf's law is related to the statistics of how word usage is distributed. As it turns out, this is also strikingly reminiscent of how income is distributed, and populations of cities, and bug repo...

Listen
Linear Digressions
Indie Announcement from 2015-12-30T15:57:02

We've gone indie! Which shouldn't change anything about the podcast that you know and love, but we're super excited to keep bringing you Linear Digressions as a fully independent podcast. Some li...

Listen
Linear Digressions
Portrait Beauty from 2015-12-27T13:34:44

It's Da Vinci meets Skynet: what makes a portrait beautiful, according to a machine learning algorithm. Snap a selfie and give us a listen.

Listen
Linear Digressions
The Cocktail Party Problem from 2015-12-18T00:17:31

Grab a cocktail, put on your favorite karaoke track, and let’s talk some more about disentangling audio data!

Listen
Linear Digressions
A Criminally Short Introduction to Semi Supervised Learning from 2015-12-04T03:13:55

Because there are more interesting problems than there are labeled datasets, semi-supervised learning provides a framework for getting feedback from the environment as a proxy for labels of what's ...

Listen
Linear Digressions
Thresholdout: Down with Overfitting from 2015-11-27T17:55:04

Overfitting to your training data can be avoided by evaluating your machine learning algorithm on a holdout test dataset, but what about overfitting to the test data? Turns out it can be done, eas...

Listen
Linear Digressions
The State of Data Science from 2015-11-10T04:36:40

How many data scientists are there, where do they live, where do they work, what kind of tools do they use, and how do they describe themselves? RJMetrics wanted to know the answers to these quest...

Listen
Linear Digressions
Data Science for Making the World a Better Place from 2015-11-06T03:43:25

There's a good chance that great data science is going on close to you, and that it's going toward making your city, state, country, and planet a better place. Not all the data science questions b...

Listen
Linear Digressions
Kalman Runners from 2015-10-29T03:10:02

The Kalman Filter is an algorithm for taking noisy measurements of dynamic systems and using them to get a better idea of the underlying dynamics than you could get from a simple extrapolation. If...

Listen
Linear Digressions
Neural Net Inception from 2015-10-23T02:25:48

When you sleep, the neural pathways in your brain take the "white noise" of your resting brain, mix in your experiences and imagination, and the result is dreams (that is a highly unscientific expl...

Listen
Linear Digressions
Benford's Law from 2015-10-16T03:30:43

Sometimes numbers are... weird. Benford's Law is a favorite example of this for us--it's a law that governs the distribution of the first digit in certain types of numbers. As it turns out, if yo...

Listen
Linear Digressions
Guinness from 2015-10-07T03:30:33

Not to oversell it, but the student's t-test has got to have the most interesting history of any statistical test. Which is saying a lot, right? Add some boozy statistical trivia to your arsenal ...

Listen
Linear Digressions
PFun with P Values from 2015-09-02T03:24:36

Doing some science, and want to know if you might have found something? Or maybe you've just accomplished the scientific equivalent of going fishing and reeling in an old boot? Frequentist p-valu...

Listen
Linear Digressions
Watson from 2015-08-25T02:26:20

This machine learning algorithm beat the human champions at Jeopardy. What is... Watson?

Listen
Linear Digressions
Bayesian Psychics from 2015-08-18T00:05:04

Come get a little "out there" with us this week, as we use a meta-study of extrasensory perception (or ESP, often used in the same sentence as "psychics") to chat about Bayesian vs. frequentist sta...

Listen
Linear Digressions
Troll Detection from 2015-08-07T20:56:36

Ever found yourself wasting time reading online comments from trolls? Of course you have; we've all been there (it's 4 AM but I can't turn off the computer and go to sleep--someone on the internet...

Listen
Linear Digressions
Yiddish Translation from 2015-08-03T03:06:39

Imagine a language that is mostly spoken rather than written, contains many words in other languages, and has relatively little written overlap with English. Now imagine writing a machine-learning...

Listen
Linear Digressions
Modeling Particles in Atomic Bombs from 2015-07-06T23:30:15

In a fun historical journey, Katie and Ben explore the history of the Manhattan Project, discuss the difficulties in modeling particle movement in atomic bombs with only punch-card computers and in...

Listen
Linear Digressions
Random Number Generation from 2015-06-19T18:49:55

Let's talk about randomness! Although randomness is pervasive throughout the natural world, it's surprisingly difficult to generate random numbers. And even if your numbers look random (but actuall...

Listen
Linear Digressions
Electoral Insights (Part 2) from 2015-06-09T02:46:17

Following up on our last episode about how experiments can be performed in political science, now we explore a high-profile case of an experiment gone wrong. An extremely high-profile paper that ...

Listen
Linear Digressions
Electoral Insights (Part 1) from 2015-06-05T20:38

The first of our two-parter discussing the recent electoral data fraud case. The results of the study in question were covered widely, including by This American Life (who later had to issue a retr...

Listen
Linear Digressions
Falsifying Data from 2015-06-01T21:04:10

In the first of a few episodes on fraud in election research, we’ll take a look at a case study from a previous Presidential election, where polling results were faked. What are some telltale si...

Listen
Linear Digressions
Reporter Bot from 2015-05-20T23:16:18

There’s a big difference between a table of numbers or statistics, and the underlying story that a human might tell about how those numbers were generated. Think about a baseball game—the game st...

Listen
Linear Digressions
Careers in Data Science from 2015-05-16T05:43:44

Let’s talk money. As a “hot” career right now, data science can pay pretty well. But for an individual person matched with a specific job or industry, how much should someone expect to make? Sinc...

Listen
Linear Digressions
That's "Dr Katie" to You from 2015-05-14T17:37:48

Katie successfully defended her thesis! We celebrate her return, and talk a bit about what getting a PhD in Physics is like.

Listen
Linear Digressions
Neural Nets (Part 2) from 2015-05-11T14:37:51

In the last episode, we zipped through neural nets and got a quick idea of how they work and why they can be so powerful. Here’s the real payoff of that work: In this episode, we’ll talk about a b...

Listen
Linear Digressions
Neural Nets (Part 1) from 2015-05-01T18:59:28

There is no known learning algorithm that is more flexible and powerful than the human brain. That's quite inspirational, if you think about it--to level up machine learning, maybe we should be goi...

Listen
Linear Digressions
Inferring Authorship (Part 2) from 2015-04-28T16:56:24

Now that we’re up to speed on the classic author ID problem (who wrote the unsigned Federalist Papers?), we move onto a couple more contemporary examples. First, J.K. Rowling was famously outed u...

Listen
Linear Digressions
Inferring Authorship (Part 1) from 2015-04-16T17:25:21

This episode is inspired by one of our projects for Intro to Machine Learning: given a writing sample, can you use machine learning to identify who wrote it? Turns out that the answer is yes, a per...

Listen
Linear Digressions
Statistical Mistakes and the Challenger Disaster from 2015-04-06T19:36:56

After the Challenger exploded in 1986, killing all 7 astronauts aboard, an investigation into the cause was immediately launched. In the cold temperatures the night before the launch, the o-rings...

Listen
Linear Digressions
Genetics and Um Detection (HMM Part 2) from 2015-03-25T17:29:32

In part two of our series on Hidden Markov Models (HMMs), we talk to Katie and special guest Francesco about more useful and novel applications of HMMs. We revisit Katie's "Um Detector," and hear a...

Listen
Linear Digressions
Introducing Hidden Markov Models (HMM Part 1) from 2015-03-24T15:57:03

Wikipedia says, "A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states." What does that even ...

Listen
Linear Digressions
Monte Carlo For Physicists from 2015-03-12T23:18:01

This is another physics-centered podcast, about an ML-backed particle identification tool that we use to figure out what kind of particle caused a particular blob in the detector. But in this case,...

Listen
Linear Digressions
Random Kanye from 2015-03-04T23:04:45

Ever feel like you could randomly assemble words from a certain vocabulary and make semi-coherent Kanye West lyrics? Or technical documentation, imitations of local newscasters, your politically ou...

Listen
Linear Digressions
Lie Detectors from 2015-02-25T18:20:51

Often machine learning discussions center around algorithms, or features, or datasets--this one centers around interpretation, and ethics. Suppose you could use a technology like fMRI to see what...

Listen
Linear Digressions
The Enron Dataset from 2015-02-09T00:00

In 2000, Enron was one of the largest and companies in the world, praised far and wide for its innovations in energy distribution and many other markets. By 2002, it was apparent that many bad app...

Listen
Linear Digressions
Labels and Where To Find Them from 2015-02-04T02:30:47

Supervised classification is built on the backs of labeled datasets, but a good set of labels can be hard to find. Great data is everywhere, but the corresponding labels can sometimes be really tr...

Listen
Linear Digressions
Um Detector 1 from 2015-01-23T20:16:12

So, um... what about machine learning for audio applications? In the course of starting this podcast, we've edited out a lot of "um"'s from our raw audio files. It's gotten now to the point that,...

Listen
Linear Digressions
Better Facial Recognition with Fisherfaces from 2015-01-07T01:33:50

Now that we know about eigenfaces (if you don't, listen to the previous episode), let's talk about how it breaks down. Variations that are trivial to humans when identifying faces can really mess...

Listen
Linear Digressions
Facial Recognition with Eigenfaces from 2015-01-07T01:30:40

A true classic topic in ML: Facial recognition is very high-dimensional, meaning that each picture can have millions of pixels, each of which can be a single feature. It's computationally expensive...

Listen
Linear Digressions
Stats of World Series Streaks from 2014-12-17T00:41:39

Baseball is characterized by a high level of equality between teams; even the best teams might only have 55% win percentages (contrast this with college football, where teams go undefeated pretty r...

Listen
Linear Digressions
Computers Try to Tell Jokes from 2014-11-26T18:59:56

Computers are capable of many impressive feats, but making you laugh is usually not one of them. Or could it be? This episode will talk about a custom-built machine learning algorithm that searches...

Listen
Linear Digressions
How Outliers Helped Defeat Cholera from 2014-11-22T00:00

In the 1850s, there were a lot of things we didn’t know yet: how to create an airplane, how to split an atom, or how to control the spread of a common but deadly disease: cholera. When a cholera...

Listen
Linear Digressions
Hunting for the Higgs from 2014-11-16T00:00

Machine learning and particle physics go together like peanut butter and jelly--but this is a relatively new development. For many decades, physicists looked through their fairly large datasets ...

Listen