ACM-IMS Interdisciplinary Summit on the Foundations of Data Science

– [Jeanette] Good morning
everyone, welcome to the inaugural ACM-IMS Interdisciplinary Summit on the Foundations of Data Science. David Madigan and I are
pleased you are able to attend what we hope will be
not just the beginning of a new collaboration
between computer science and statistics, but also the beginning of a premier event
focused on the foundations of data science. Data science is the study of
extracting value from data. The techniques we use
for extracting that value rely on computational
and statistical thinking. The explosion of interest
in data science recently is due to the advances in algorithms, especially machine learning,
enormous computational power, especially clusters of CPUs and GPUs, the luminous amounts of data
generated and collected, feeding data-hungry
albums, and the rapid pace of technology transition of
ideas from lab to market. With efficient algorithms,
big compute, and big data using statistical models to do prediction, exploration, inference, and reasoning about uncertainty, now becomes
tractable and scalable. The value gain from this
data extraction accrues through the domain expert. A scientist gets value by
discovering new knowledge. Is that a new exoplanet I see? A mayor gets value by
data-driven decisions that can affect the
community’s livelihood. Given sea level rise, predictions, how should I regulate housing
in coastal neighborhoods. A CEO gets value that can be measured by the company’s bottom line. How much revenue did I get
from users clicking on ads? Thus, to work as a data
scientist is not only to have a command of
the foundational methods for example computer science
and statistics but also to work in the context of a domain. Today, we see data scientists
working with domain experts. Tomorrow, we will see domain experts imbued with the knowledge
of data science methods. Through our event today, we will hear about how far we’ve come
in state of the art, of data science methods,
specifically deep learning and reinforcement learning
and their applications in science, healthcare, and more. What properties such as
robustness and stability, we seek of these methods,
and societal implications regarding fairness, privacy and ethics in the use of these
methods as they automate decision making about people. We will close with the
discussion including with you in the audience on
the future of data science. We have what I think is a
dynamite set of keynote speakers, panelists and panel moderators. So now for the facts. I’d like to thank ACM, in
particular Vicki Hanson and Pat Ryan and IMS in
particular Xiao-Li Meng for instigating and
orchestrating this collaboration. Our steering committee Magdalena
Balazinska, Joey Gonzalez, Chris Holmes, Yannis
Ioannidis, Ryan Tibshirani and Daniela Witten, Google and Red Hat, for their sponsorship. And finally NSF, in particular
MPS math DD, Juan Meza, and PD Nandini Kannan, size CCF, DD Myers Cleveland and size IS DD Henry Kautz for their support for travel grants for some of our participants. And finally, the National Academies who while not a sponsor has
expressed enthusiastic support for this event, and its
future incarnations. Before I close, I would like to announce that both ACM and IMS intend
that this event be the start of longer term partnership to support this burgeoning field of data science. Also, while today’s event focuses primarily on computer
science and statistics, I want to acknowledge that the
foundations of data science also draws on other fields. For example, signal processing
from electrical engineering, and optimization from operations research and analysis from applied math and more. David, and I expect that future
events and the foundations of data science will
reach out to these fields. So thank you, and please enjoy the day. One housekeeping announcement,
the wireless password is ACM 2019. It’s also the password
you can use to access the live streaming of this event. To get to that you have to
go to the event website. And I assure you, you’re all smart enough, to do that, on your own. So now let me introduce David Madigan, who is going to introduce
the first keynote speaker David Madigan is a colleague
of mine at Columbia University. He used to be the executive vice president of Arts and Sciences. He’s a professor at Columbia. He’s a professor of statistics. He’s very well known for a project on observational health
data, called Odyssey. That’s coordinated by Columbia. And he’s a good colleague
and a good friend. Thank you very much. (audience clapping) – Thank you. Thank you, Jeanette. Just very brief how to add my welcome. We’re excited this day is here. We hope this is the first
of a series of of events. And it’s my great pleasure to introduce the first keynote speaker, we
have three keynote speakers, and then a set of panels today, I think should be an exciting day. The first speaker is
Professor Emmanuel Candes he’s a professor of
mathematics and statistics and Electrical Engineering at Stanford. And he has won countless awards. I just mentioned that he’s a
member of the National Academy and a couple of years ago, became a MacArthur Fellow. And please join me in
welcoming Professor Candes. (audience clapping) – [Emmanuel] You can close this. It works, all right,
can you see my slides? Okay, not quite anyway
so thank you very much, David, for the kind introduction. It is a real privilege to
take part in this event whose principal is very close to my heart. I’m not saying this lightly. At Stanford, I’m currently co leading with Jure Leskovec, a computer scientist and initiative in data science, which I hope will be major. And that will bring
together computer scientists and statisticians as we do today, but also many other scientists,
and domain engineers. And so this is the reason
I’m happy to be here. Because I want to learn
about your perspective on data science, so that we
do a better job at Stanford. So either preamble, I’d
like to observe that, over the centuries, we’ve
learned we certainly learn to build extremely powerful
mechanistic models, able to make astoundingly
precise predictions and our everyday life
depends on these models. And the validity of this model has been tested again and again. And I really regard this
as the crowning achievement of the human spirit. And I really like this
quote from Richard Feynman, who said that “10,000
years from now, let’s say,” ” what we will remember
of the 19th century” “is not the Civil War,
but it’s the discoveries” “of the laws of electromagnetism.” And I certainly subscribe
to this point of view. Now, perhaps equally exciting,
also less noted that the time was the fact that we
discovered non mechanistic ways of making useful predictions. And Galton, for example,
in the late 19th century, introduced the idea of regression
to predict a quantitative outcome from measured features. And a few years later, the
gentleman you see on the right, Sir R.A Fisher introduce
the idea of classification, that is to predict a class
label from measured feature, and you see the famous drawing from his Iris species
classification problem. Now, if we fast forward
to today, of course, we’ve made tremendous
progress in building up powerful non mechanistic
predictive models, I would like to single
out to contributions, especially the first around random forest, gradient boosting, XGBoost. And I’d like to single out
the work of Leo Breiman, and of Jerry Friedman,
my colleague at Stanford. And of course, today we
celebrate deep learning models, you’ll easily recognize
these three gentlemen, and they will later tonight
receive the Turing Award for their contributions. And so these models are
very, very powerful. At the same time, if
you go around campuses, especially if you go around Life Sciences, there’s a bit of attention. You know, these models
are extremely complex. They’re not easily insurance credible, and people are being
uncomfortable with this. But I think this level of
uncomfortable ease with time, and I think we’ve already said easing, because, you know, the opportunities
that these models offer are so enormous, that people
want to use them to improve their science and their business. Now, having said, these are two things, we can absolutely not surrender. It’s the replicability of science and of scientific findings. And by that I mean that the conclusion the scientific conclusions I draw today need to hold up tomorrow. And if I don’t do this, and I
don’t know what sciences is, and also the reliability that is we using, these models to make decisions, and we’ll have a panel
about this in an environment that are more and more critical,
it’s extremely important to understand the validity
of the predictions to understand their certainty
and to be able to communicate these uncertain, effectively and that’s what I’ll try to talk about. So we’re gonna see these two
things in turn replicability and reliability in the in
the era of extremely modern and complex machine learning systems. Okay, so what statistics can offer? Well, what one thing we could do, is to take these black boxes, let’s say the deep learning with I don’t know how
many hundreds of layers and try to analyze it. I don’t know if you can do it, I can’t. And so what we’re gonna do
instead In this lecture, is we’re gonna take a
completely different route, we’re gonna want to use these
sophisticated algorithms. Now I’m gonna call them black box, I’m not meaning it’s not penetrative. I’m just calling them black box, because they’re hard to analyze. And we’re gonna try to build
around these algorithms, new statistical methods that
will leverage their power, but at the same time guarantee
replicability and reliability of these systems. And we’re gonna build wrappers
so that we can make valid inference what can value draw
valid conclusions from data. Okay, so this is the agenda for the talk. And so we’re going to
start with replicability. I’m sure you know, we have a problem. There is a gigantic replicability crisis, which feeds the media on
almost a weekly basis. If you see a drawing
on the, on the bottom, you have the cover of The Economist, you also have an article
that Andrew Gelman who will speak later wrote in The New York Times in November. And so just to give you a
flavor of this replicability crisis is not only touching, you know, fields like psychology, but it’s touching the life sciences, Amgen, which is a
company, a biotech company located in Souther Oaks, California. What they did is they take 53 studies that they consider to
be absolutely landmark in basic cancer science
and they tried to replicate the findings, recruit new people, try to replicate the findings. The report of Begley and
Ellis in Nature 2012, which triggered an enormous
amount of attention is that they could only
replicate six of these 53 studies HealthCare Bayer, which
is a German equivalent to 67 studies, perhaps
some of them are the same, could barely replicate
more than a quarter. And so we have an enormous
replicability problem, currently, that we need to address. Now, why is this happening? I think it’s happening, because if you try to
follow me for a minute, I think we reverted the
scientific paradigm. It used to be if the day
of the great R.A Fisher that you advanced the theory, the theory makes predictions
and then you test the data, you test the validity of the predictions against carefully collected data. But now data comes before everything else, we have reverted the first two steps, we have enormous data
sets that are available prior to the formulation
of scientific hypothesis and people look everywhere. And so there’s nothing wrong with this, but we have to remember statistics, if we want to account for
this fact that now data sets come first and we ask questions later. And so what I’m going to
talk about is new tools to address some of these problems, right to address the new
scientific paradigm that we’re in. So an example of a data drug
driven scientific problem, if you go to a life science department, you know, we have phenotypes, for example, whether I have the Crohn’s disease or not, I have information about
genetic information about individuals that can
measure your DNA variations along the genome at 300,
400, 500,000 locations. Later on, I’ll show you analysis of the UK Biobank data sets where we have 600,000 variations
that have been measured. And there’s question that, of
course, everyone’s to assess, is which genetic variations
will affect your trade, for example, getting the influenza risk of getting a disease. And I think in lots of data science, a lot of questions were like this, you measure a lot of
things because you can. And the end of the day, what matters. Now here we have been outside
of predictive modeling, because I’m not, I don’t
care about predicting your disease status,
I can just observe it, I don’t care about predicting
your glucose level, I can just measure it, what I want to find is I want to find the factors
that drive this glucose level up so that I can do research
on them and not waste time. And so of course, I want to
be able to kind of identify those variations that cause or that make you more susceptible to a disease without making
too many false positives, because if I make too
many false positives, then I’m wasting everybody’s time. And that’s not what science is about. So to formalize the
selection problem a bit, we can imagine that we have individuals and that’s actually data sets, we’re looking at the
they come with their ex, for example, Gino type information, they come with their why disease status. And we want to find of
all these 500 mutations that we have measured,
which one actually influence your disease status. And so to formalize a
problem a little bit, we’re gonna say that a
mutation is of interest to me, if it changes the probability
of getting the disease, even if I know the value of
the other mutations, right. So a mutation is interesting, if it provides information
about your disease status, beyond what I already know, okay, and so we’re gonna say that a disease, a mutation is not interesting. If I know the value of also mutation, it doesn’t change the odds
of getting the disease. For those statisticians in the room, and I know we’re not many today, it would be easy, it would be saying that, if you saw that this was a gigantic logistic regression model, I’m just interested in whether the regression coefficient is zero or not. Okay, for the computer
scientists in the room, I can imagine that have a
gigantic graphical model, have a node of interest Y,
which is your disease status. I’ve measured hundreds of
thousands of variables, but probably not all of them influence the outcome of interest. And I want to find the mark of blank, like which ones influence the response? Okay, and as I said, we’re in
the age of complex systems. And we want to use all the
good stuff I mentioned, to detect which variables
significantly drive the phenotype of interest. Okay, and now, what’s the problem? The problem is here, I have a model, which relates the
covariance with a response, I use one of these black
boxes to actually measure the importance of these variables, probably, presumably,
a high score indicates that you’re likely to
influence the response. And here in this example,
I have 500 covariance. I measure and we like to
think in this example, that the points with the highest
scores are probably driving the, the like the
likelihood of the response. But what I did is I
just took the same model and generated new data and
I got something like this. And I generated a new
data model doesn’t change, I just randomly generate
data from the model and use a black box to re compute scores. And I get this first time I get this, at fifth time I get
this, you as you can see, you see tremendous variability, and I think, I cannot say
it better than you have, Ben Yemeni, one of my stats hero. That really, that’s the
problem of modern science. You see, I was taring the
statistics to separate signal from background noise. And now modern science is
not this modern science is a problem of selecting. Among many, the promisingly lids, when I have extremely noisy estimates about each one of them. So that’s a problem we care about. So, just as a way to look,
what do we do at Stanford to address this replicability issue? Well, we’re gonna try to build this wrapper around this black box. And so are people familiar
with what is a snip? Okay, very good, so if
you don’t know snip, is just a genetic variation
somewhere on your genome. For each measured snip that
I measured by the company 23 and me, for example, I’m
gonna create a fake snip. So my laptop, the one you
see here, is gonna create is gonna take a set of
features that has been measured about Emmanuel, and try and
is gonna select fake features. And the idea of the method
out present is to run the scoring procedure
using your deep learning, for example, on not only
the original feature, but the original feature
augmented with his fake features. And on this example, we see, for example, the number of features that
passed a certain score, let’s say one, or point
like whatever is indicated by the black line, whether they were 49 original features, or 49 snips that passed these levels. But I can also see when
I run the procedure, on the original feature,
and the fake ones, that a lot of fake variables,
things I’ve created on my computer that
cannot possibly influence my snip level, because they
were generated by tossing coins on my laptop 24, past the threshold. And what I’ll try to convince
you, that immediately tells you that they are
probably 24, among the 49 that have no relevance whatsoever. And so if you were to report
this things that were above the black line, then probably
24 will be false discoveries, what we call false discovery, and you’d have a very high
reproducibility rate of 50%. Scientists following up on
your study would waste 50% of they’re time. So now we’re gonna make this features. Well, I know we’re not many statisticians, but if your statistician, you’d say, Well, I’m gonna use permutations, you know, that’s going back to Fisher, I could think about, maybe
I’m gonna go as close to my screen, you know,
the green guy is Emmanuel. I have his phenotype, I
have his genotype in x. And I could actually try
to think about matching, creating fake features by bringing the… The genotype of a random individual, the yellow guy, for example, Ryan. And use this as a control to calibrate the importance that the black box scores that the black box compute
for the green features. Alright, so that seems
like a reasonable option. This is called the permutation method. A lots of people use
this not in this context, but in different contexts
and the thing it really doesn’t work, that is what you see, here are the scores that the
black box gives to features. This is an engineered example. So I know what drives the response. And the sea saw points are
the scores of black box, gives to features that have
no influence on the response. And the dark blue points
are the features it gives to the permitted features. And you can see that there
is no way I can use this, as a way to assist to assess
what’s happening on the left. And so here, it seems
very important to reason, statistically correctly,
this doesn’t work. But if we use the knockoffs, which we’re gonna talk
a little bit in a while, then I get new scores,
because I impute carefully engineered fake features. And now the scores I get to
see seem to look like this, scores I’ve got on the left, and maybe I can use this is
sort of an avid control group to use for calibration. Okay, that’s precisely
what we’re gonna do. So we’re going to create features. And these features have a
property and the property is that if you give me
a list of variables, x, so I’m just gonna show
you I’m gonna give you the key ingredients. So you see, you give me a
long list of features x, which at the bottom is indicated in blue. So for some reason, my
pointer doesn’t work. So it’s indicated in blue. My goal is to create a set
of fake features in red in such a way, and now we can almost play a Kagggle competition in such a way that, if I were to take a true snip, and a fake snip, and I work to swap them, and I work to give this data set to you. And as the data is swab or not,
you will not be able to tell cases, this is a statistical construction that we need to apply. You give me a list of variables, I create a new list of variables, in such a way, that if I take a feature and a fake and I swap
them, you cannot tell. That’s a recipe. Okay, so you can tell. And now well, we’re gonna feed this thing into the black box. And now what happens is that
when the black box computers the scores, if the
feature has no influence on the response condition on the others, then the scores also you cannot tell. Right and so is indicated
on this plot on this plot. On the left, you see the black the scores, that the black box gives two features that are conditional
independent of the response. On the right, you see the knockout scores, some scores need to be
high, but you see that, their fake scores are equally as high. And so that use that. Okay, you can use this as a calibration. And now how are we going to calibrate? Well, now the black box scores stuff, I know is fake and stuff I
don’t know whether it’s true or irrelevant or irrelevant
and start to put a threshold and know that the how
I complete this scores is totally irrelevant. You can use the deep learning
whatever you want to do this. And I’m just going to try
to adapt the threshold, until I get an in a place
where I know that the fraction of those discoveries is under control, by carefully observing what’s
happening on the right side, versus what’s happening on the left side. Okay, and if you do this,
it’s a simple procedure. And if you do this, well, what the theory says is that yes, I can take any black box any compute any features,
statistics I like. And then I can choose the threshold such that what I want to
call it false discovery rate, the rate of reproducibility
is control at the 10% level or anything like that. And that’s a theorem. That’s mathematical fact, that
if you apply this procedure, no matter how Y depends on X, no matter what the false
discovery rate will be control. Okay, for any black box,
so you see this idea of this rapper. Okay, so we’ve engaged in,
we have created a little knockoff factory at Stanford. And it’s a cool topic. And if you’d like to know more about this, Steven Bates who made a major
contribution is in the room, you give me a bunch of features, I need to create fake features, such that when I take any of them, I exchange them I cannot
tell which is which. And I think now I think
it’s fair to say thanks for the work of Steven Bates,
and my former students, Lucas Janson, and
Wenshuo Wang the student, for any distribution,
we know how to do this, I consider this a solved problem. And if you want to play
with this, we have website and code that you can go visit to learn how to make knockoffs. Now we do all of this because
we want to do science. And so we spend an enormous
fraction of our time actually analyzing genetic data, I can only give you a snapshot. And the snapshot is this, like we have these UK Biobank Data. Now we running these
algorithms on 600,000 features 600,000 mutations on 350,000,
subject of British descent, as collected in the UK Biobank,
and what the method does, it discovers a lot of interesting stuff. And we’ve tested the method
does it discover things that are reproducible? So we apply the method on smaller studies with the summit analysis? Yes, when it finds discoveries, they tend to be replicated
in bigger studies and so on, we did a lot of meta analysis, which means that the findings we find, we’re extremely confident
that they actually real, of course, 10%, that the role
of the game might be false. Okay, and we can actually
pinpoint locations on the genome where we think that very
interesting things happen. And so we have created data science tools where you can scan the genome. And for lots of phenotypes
you would be able to pinpoint locations of in of genetic
variation under genome where we think that something, extremely interesting, is happening. And again, you don’t have
to take my word for it, we have a website codes, where you can use these
ideas on your own data sets. All right, I want to make a
plug for deep learning one more. And that’s, you know,
to generate knockoffs, I need to know the distribution of X, so that I can build this exchange ability. And in some problems, in many problems, actually, I will not know
what it is curve correctly. And so what we are
thinking of is to repurpose these wonderful tools
that you guys have created to actually generate knockoffs. And this is a bit crazy,
but it works extremely well. So we’re gonna train a neural
net to generate knockoffs. So if you think, I mean, in this room, people are literate,
you’ve heard about guns, and guns is your train and
your net to make things images. Here, we’re gonna trade a
neural net to make fake snips. And of course, it’s different. Because I’ve seen images
I’ve never seen a knockoff. So there’s a big difference here, but bear with me for a second. So I’m gonna train a neural net, this is gonna generate knockoffs and then we’re gonna play these two game where there’s gonna be a
metric that’s gonna measure, how well how well can
you tell which is which. And you’re gonna have a score, you’re gonna have a loss function. If I swap a feature is
it easy for you to tell, which is a true variable, and which is the fake variables. And then we’re gonna train
them this neural nets to improve knockoff generation
until it’s becoming very hard to tell, which is which. And that works very well. And so we’re very happy to
use deep learning models to create fake features. And of course, we got this
inspiration from great work in in CS, about the use of
general adversarial networks to create fake images here, we just create fake snips
that are and distinguishable from true snips, so that we
can go through our program. Okay, so this, and if you
want to build deep knockoffs, then there’s also software for this. Alright, I’m going to
move to the last part of the talk, which has to do with, Can we make reliable prediction? How do I communicate uncertainty? So I’ve been in the
business for a little bit I work on the Netflix challenge, this was machine learning 15 years ago, predict was I’m going
to like a movie or not, honestly, if you get it
wrong, doesn’t matter. But now it matters, because now as we’ll hear later on today, we use machine learning to
decide who gets into college, who gets parole, who gets bailed. And so if I have a
machine learning system, assisting Stanford’s admissions office, there are many ways you can
get into Stanford, as you know. (laughing) Now I’m gonna be in
trouble with my Provost, he told me, you know, anyway. And then the machine
learning algorithm says that, I can predict your GPA
would be three point 62. After two years of college, well, what do I do with
this 3.62 plus a minus what? I really need reliable systems
I want but at the same time, I want to use this black box that I can is not easily analyze, I need to be able to
communicate on certainty, right? And so one way to think
about this is as follows. I’ve got a bunch of data sets,
a bunch of training samples X1, Y1, Y is it variable
I’m trying to predict, I have a new test point,
that’s a new college applicant, I see speech, her features, what’s school, she did
go to or high school GPA, ACT scores, extracurricular activities, whatever you want to throw in your model. And I have to predict GPA. But now I’m gonna ask you to do more, I’m gonna ask you to, predict a range. And so what I’m asking the
machine learning system now to do is to predict an interval such that the GPA falls in
your range 90% of the time, I say, 90, it could be 95,
whatever number you want, it’s not relevant. So I want to say I want to build a system that does the following
that based on the candidates High School, identify your
high school GPA, ACT score, so many features. The GPA is predicted to
fall in the range 3.4, 3.9. And if you were to
deploy this in the world, it would be 90% accurate, 90% correct. And that’s a guarantee
you have to give me, okay. And of course, I want to use these models, because maybe with these
models, I can get short range that are informative, because
they can predict well. But can you actually compute
the standard deviations of a neural net estimate? It’s gonna to be quite tricky, right. So we’re gonna use ideas
from a computer scientist named Vladimir Vovk, who introduced the field
of conformal inference. But in this lecture, I’ll give
you some new developments. But I want to acknowledge that, this is we this is there’s a field called conformal inference we heavily rely upon. So suppose that, well,
I was in a perfect world where I could observe both, I knew the distribution of Y and X, everything is revealed to me. So I know the joint
distribution of Y and X. And so a person comes with his
X, like say X equals three, I know the joint distribution. So I know that conditional
distribution of argument x, and I can therefore
calculate the quantile. And I can do this for every X. And therefore I can find, no matter what, I can find the quantity, the
interval I’m looking for. And in some sense, it’s best. But that assumes that I
know the distribution, which is selling away as a
machine learning problem. That’s assuming that
machine learning is solved. Instead, I just have this. So what do we do? Well, we’re gonna to try to
estimate quantile from data and we’re gonna use a tool
called quantile regression. For those of you who do a
bit of machine learning, I’m going to basically feed quantile, by minimizing the pinball loss. And I’m gonna to get quantile out of this, on some training sample with this work? No, imagine training on neural
nets, you get zero residuals, it’s not gonna be very good and in fact, if I give you a fresh sample, the quantile you would
get using this method would dramatically under cover, “you’d say all these range
for this group of students” ” is this,” and in fact,
is completely false. Okay, so we’re gonna to
adopt this a tiny bit. And we’re gonna to build a wrapper around my quantile regression estimates, which for which I can use
XGBoost, deep learning, whatever, again, the idea of building a rapper. And now you give me some samples, I’m gonna divide them into training set and a calibration set. On the training set, we’re gonna
to run quantile regression, what we used to do, but
now I’m gonna see how well I’m doing on the calibration set. And now watch, because
it is extremely simple. I was not sure I should present
this because it’s so simple. And what I’m saying is, okay,
I have the green pointed green point as a calibration points. And I can fall within or outside of range. If you within range, you’re
gonna get a negative score, if you outside of range, you’re
gonna get a positive score. Now, what are the scores? I’m gonna look at this sign
distance is the boundary, how far are you from the range, from the end point of the range, again, positive, if you
are outside negative if you’re inside, so I
get a bunch of number, let’s say that I have a
calibration set of 200. I have 200 numbers, I
think that 90th pacentile, of these 200 numbers, that
is 180 as largest value. And now watch, what I do is
I adjust my confidence back to whatever was calculated before. And then plus or minus
is quantile, that’s it. That’s extremely simple. Extra extremely simple. In particular, if I were to
do a good job to start with, then if you follow what
I’m saying the quantile would be exactly at zero,
so I would not touch it. If I’m conservative, then the
contents would be negative, I would actually shrink it
if I’m anti conservative would extend it. Okay, and now how does this work, I give you a fresh sample
that you will 90% is 90.09. That’s okay, so I know I have to stop. So this is a result. This is a fact it’s in fact, again, another mathematical truth
that is very simple method, can give you the precision you want. Okay, and that world’s no
matter what the how the data is distributed, any sample
size, any black box algorithm you might want to use
to feed the quantile. Okay, we’ve applied this
to medical services, when we need to predict
medical expenditures from lots of social
economic status variables, trying to use how many
visits to the doctors, you’re gonna make hospitals and so on. And this method is unbelievably
precise and accurate, you will 90% you exactly get 90%, you will 95% you exactly get 95%. Okay, so and again, we have
code and stuff you can use. So my time is up, but I
thought that as we are leaving and we’ll hear a lot about it today, we’re living in a world where
machine learning is applied to other things and Netflix, and things that are of critical
that support human decisions that are extremely important, it’s important to make it trusted. So I’ve discussed two things
to build a wrapper around the algorithm we want to use, so that we do not run into the
problem of reproducibility. And of course, this is dramatically
connected to causality. Of course, if I knew how to solve the causal inference problem, then reproducibility problem
would go wrong would go away. But there’s a an arrow in the other way. In fact, I can show that
sometimes controlling for the false discovery
rate will actually control for the false non causal rate. And I think it’s extremely
important to be able to make to assess uncertainty
when we make predictions, especially when a student
comes up and has been admitted to college, I’ve shown you
a method that would be, would give you valid coverage, no matter which group you belong to, no matter what protected
characteristics you have. And I think it has to
inform fairness, obviously. And with this, I’d like to
thank you for your attention. (audience clapping) – [David] So in the interest
of time, could the panelists please for the people for
the first panel, please? Go to the stage and while we’re (mumbles). Take one question from Professor
Candace over here, Yes. – I have a question about the back box? So now you have reproducibility
and reliability, if you have you… Do you have a favorite back box? – I have what? – A favorite black box? (mumbles) – Now you have to measure. – So we can have an example we apply– (mumbles) – So the question is now
that we have a way of making the black box reliable, do
I have a favorite black box? Not really. But in the lot of examples,
we’ve tried on UCI data, and in in genetics and so on, I found that for example,
UCI data, XGBosst is a good black box to use, if I want to predict medical expenses, for example, from various features. For genetics, it says like
genetics is fairly linear, like somehow features
contribute in a linear way. I found that regularize
logistic regression, the last Su, pioneered by Robert Gramacy, sitting on the second row are very useful. And we use a lot of different black boxes. Okay, rather, yeah. – [Joseph] Okay, thank you very much. (audience clapping) All right so, I guess black
boxes are a great segue into the next session. So this session is a open panel, we’re going to be talking
about deep learning, reinforcement learning
and the role of methods and data science. It’s a pretty broad topic. Rather than me introducing
all the panelists and listener ask their accolades. I’ve asked each to give
a short introduction, your name and what you work on. – Hi everyone I’m Shirley Ho, I’m a group leader of
cosmology cross data science at the new Institute of Flatiron
Institute, which is 100%, funded by Simon’s Foundation, and also have a joint supplement
in Princeton University. – Hi, I’m Sham Kakade. I’m a professor at the
University of Washington and computer science and statistics. I work on machine learning
and artificial intelligence with a particular focus on
the mathematical foundations and the focus on reinforcement learning. – Hi, I’m Suchi Saria, I’m a
professor at Johns Hopkins, attract machine learning
healthcare lab there. And my work is focused on two fronts. One, as more and more
data becoming health, available in healthcare, how
are healthcare delivery models going to change? That means how is the
practice of healthcare going to change? And what that means is
what are some new tools, new pipelines, we will be in
order to be able to change the way they practice to
bring data at the front lines. And then the second
aspect of it very related to it is what a fundamental
methodology is we need, in order to be able to
make machine learning and data science available
at the front lines in a safety critical domain,
like healthcare or education. Along those lines, we do a
lot of work on, you know, taking messy bias data sets
and working at the intersection of causal inference and machine
learning to produce exactly the kinds of things
Emmanuel was talking about, which is safe and reliable ML. – Hi am Manuela Veloso, I’m currently the head of AI research, new effort at the J.P. Morgan
and Chase, JPMorgan Chase. And I’m also on leave from
Carnegie Mellon University where I’ve been a professor
for the last 32 years. So I am in this transition from academia, to understanding the corporate world. And basically, it has been a discovery of the complexity of the real world, in particularly the financial domain. I’ll tell you more about it. – Excellent, I should introduce myself. So my name is Joseph Gonzalez. I’m an assistant professor at UC Berkeley, I study research at the intersection of machine learning and systems. So the format of today’s
panel is I’m going to ask a sequence of questions. There are open microphones in the room. So if you want to add or
elaborate on a discussion, please feel free to join in. I will start with a few
fairly high level questions. Since this is the first panel of the day, I thought it’d be fun
to define data science. So tell me a little about what
data science means to you? How it affects education
research in your area, and your discipline? Yeah, – let me just start first by thanking, you know, Joseph to actually
organize this panel. I like to paraphrase what
I learned from actually being here on an IMS presidential
address a few years ago, on what data science is. And it says it is the
study of generalizable extraction of knowledge from data. And for me, this includes methodology and also the tools that you actually need to extract the knowledge, the domain knowledge that
not only helps you discover the knowledge, that’s
further and I understanding of the whole world around you, but also maybe producing
the data that’s required. And finally, also the infrastructure, software and hardware necessary
to make this possible. What’s more, is that I
think we need to communicate the knowledge out to the community to the relevant community. And I think that’s a really, for me, succinct
definition of data science. – So data science, so I think,
you just start with a joke. So I think I’ve heard a data
scientists refer to someone who knows, statistics better
than a computer science and computer science
better than a statistician. (laughing) which, you know, I think, to a large extent,
the field came about because while we had these
different disciplines, there’s a group of people that need skills in a variety of different areas. And, you know, there’s
a functional definition, which is, it seems like
there’s a group of people who should have knowledge, so they can work with
domain specific experts in a variety of different areas, ranging from say, bio health
to various technology areas. And, you know, I think
on the educational side, there’s just been
tremendous demand including machine learning is one of the core areas, so and it feels like everyone
needs to have foundations in ML to do it. – I agree with Sham and Shirley Ho, I’ll just add one comment here. Some of my observations from
the last few years has been, you know, if you look at fundamentally, the excitement in the last
10 years in this field is partly banned from achievements like, you
know, autonomous driving or web search, in effect, in effect like massive engineering tools that deeply utilize data
in order to provide value. And the thing that’s interesting
about that is I think, if you look at a lot of
academic data science work, or industry, data science
work, a lot of it is around taking large amounts of data and doing discovery exploration. But I think a component
that is very fundamental, which the field desperately needs is thinking about
principles of engineering and reliability engineering. So if you look at other fields,
like nuclear engineering, or civil engineering, where
you’re looking at safety, critical disciplines, there’s
sort of a very clear notion of like, if you’re going
to deploy something, or if you put something
out there, is it working? Does it work the way
you expect it to work? And I think as data scientists, we’ve been given a lot of
responsibility and freedom to go in very important
areas and start influencing and informing decision making. But that engineering principle,
or being able to guarantee that we understand what we’ve built, and we can stand behind
it, and we’re not going to, you know, make catastrophic
decisions is very key. And I think at the
moment, sort of missing, so that wasn’t, so just
sort of to add it… This is sort of, if
you had to broaden out, sort of data science, I see
this as a fundamental component of computer science, computer
engineering and engineering at large, that needs to be
brought into data science and something I’m hoping will
become part of the definition. – So I don’t know what is data science. But I do know, one thing that it’s very, takes my sleep at night,
which is the following. Throughout my life,
I’ve been working on AI and AI, this kind of like, fascinating research goal, to try to really
make machines that are able, to get his input, some kind of like data, make decisions, and actually actuate. So there is this kind of
like loop of perception, cognition and action, or data decision making an acceleration. So there is this thing,
so I see data science, as providing the substrate
to actually being able to close this loop between
the data the decision making and the actuation. Let me just share with you
something about this data science problem, which is something
that sometimes we, in our engineering of
thinking about the problem, maybe we may forget. The challenge from, for
me from data science, when combined with decision
making and actuation is that these data is generated from, the most diverse sources of the world, from different, different frequencies, different like ranges,
different certainty, different everything. So data is not unique
concept is this tremendously, how can I say messy, kind of concept. So it’s worse than languages. So languages we given even
like, it’s a finite number, even if it’s been a very
large is a finite number. Data is completely
unbounded, in its definition, from a scientific point of view, or from an engineering point of view. So I really think like that
is like the challenge here, is that we have to still
use such information, to equip a computer with
ways to support humans with decisions to support
humans with production, to eventually even take
into their own hands, making decisions based on data. So that’s why I think that
these data science field is complex, or these data processing data use these complex is
because of these horrible, and at the same time, amazing opportunity of these unbounded messiness of data, which has a lot of information,
a lot of knowledge, but it’s kind of how do you
bounce this problem, in general. So that’s what I believe
data science is on. – Actually so touching on
this discussion of action, something we can come across
and systems is thinking about data science as a science or an engineering discipline. And as we start to use data
science and learning techniques to manipulate the physical
world to make decisions for patients to alter how we take… Or how we approach scientific problems, maybe what are the big concerns we should start to think about? – Well, I don’t, I don’t know. But in some sense. So I think just to finish these thoughts, I think that ideally, what
we would like to have, be it in any data set, be health based, finance
based, transactions, any data, we would love to
have some kind of description of the data of the clean,
and all kind of like uniform that we could feed into a neural net or any machine learning algorithm. And it then it would work. I’m amazed by how hard it is, we I am, I wrote this simple white paper, called the curse of exceptions, it’s not about
dimensionality is exceptions. There is not a single case, I talked with that firm that doesn’t say, Oh, this is how transactions
are supposed to be recorded. But those transactions of these nature, which is maybe 1% of the
old thing are different. And then we say okay, but
something else is whatever. And I cannot stand anymore, buts you know. (laughing) And so in some sense, I keep saying, why can’t we? How are we gonna get all these buts, part of like not being
prune-out by outliers not being considered like noise. They are buts from the
point of view of humans their not just noise. So this is a little bit of
a difficult problem for me, from an engineering point of view, how do you represent them? And from a science point of view, you know, what, sometimes
these but things, these, like, whatever, are the ones that have more information, are the ones that you really
don’t want to, to miss. So, there’s the science of these outliers, which we all know, right? And there is also this
problem of the engineering of not missing it, anyway, – correct. So we definitely teach us
in the data science class, how to deal with dirty data. And it’s funny because
it ends up dominating a lot of the discussion that class. And then Suchi your comments
about what data do you drop, is dirty data, something you get rid of? Or is it actually the signal
that you’re looking for? A lot of you work with, with real data. Can you talk maybe about some
of the kinds of dirty data you’ve seen and data where the
dirt was actually the value. – I got, I can’t take this. I’m by training astrophysicist even have a computer science
degree from Berkeley actually. But we deal with a lot of
very diverse data sets, going from optical data
and looking up in the sky, infrared data, also radio
data, and they’re very dirty, just like what you were saying. But I think getting them
clean is one big problem. But I actually have seen the big problem about data science these
days are how to interpret what we’ve learned from the model that actually echoes
Emmanuel’s talk earlier today, how to understand that black box, scientists are very early
adopters of data science, we’ve, you know, been extracting knowledge from data for eons, you know, let me start with a story where sir Eddington actually
looked up in the sky when the solar eclipse happened in 1919, after an Einstein proposed
this new theory of relativity, so when up and collected this photons from Taurus constellation and say, okay, has they moved, because the
field of the gravity field of sun has distorted these
photons trajectories, as it moves, according to Einstein, so did that not just in one location, but in three locations on
earth, and extract this data and did statistical
analysis, for this right. But now with these new methods, like deep learning, or
reinforcement learning a lot of these neural nets, we don’t have a fundamental
understanding why it works. So that makes a lot of scientists
very hesitant to adopt, deep learning particular,
as new data science methods, and if make predictions, you
know, for college applications, for example, how do we deal with that? If you’re going to teach
university students, this is what you’re supposed
to do to discover new knowledge and new frontiers? How do you trust it? So I think that’s the next
big step to think about. – In different perspective, if we come in from machine learning angle, is, you know, duty that might
just be another source of, just a different… A data from a very different distribution. And how do we actually
learn from such data. So if I think of an analog
from computer vision, which was one of provided really
one of the most compelling breakthroughs, in deep learning,
where there was a dramatic improvement in quality of
visual object recognition, and kind of across the board, the representations
these methods would find, would start transferring to
many, many different tasks. But one of the challenges in
this area when we start moving to things like autonomous vehicles and robotic robots that
interact with the system, they’re often on noisy cameras, or even if we pay a lot and
get very high quality cameras, the pictures are from very
different viewpoints and angles, and they’re from very
different distribution, than what we trained on. And one way of progress is we
could say it’s just dirty data and find other ways to
label it a different way, is we start thinking about algorithms, where we can generally
start transferring learning. So with a small amount of
say, poorly downloaded data or corrupted data, maybe
not address corrupted data, but corrupted in that our cameras noisy or some robotic arm breaks
down, how do we actually have methods which learn
to adapt very quickly. And as they learn to adapt quickly, they can start adapt more
quickly and other problems. And I think this is a
pretty promising direction to address some of these questions. – So yeah, I agree with Sham,
definitely on this topic, so over the last few years, so one of the reasons
I moved to Hopkins was, I was kind of feeling that, you know, it’s been super fun to work
on hard machine learning, stats, papers, but there
was sort of a little bit of this hypocrite, you
know, hypocrisy internally, realizing that, you know,
I was, like, practically none of them had been
translated at scale practically was very unclear to me
that they were really gonna influence human decision making. So I’m moving to Hopkins
working very, very closely with the health system there. So health system means
the hospital leaders, the C suite leaders,
and over the last three and a half years, my naivety led me to
agree to thinking that, we should take some of the
ideas we’re producing in the lab where we think there’s really
an opportunity to influence the clinical decision making and, you know, partner
with the whole system to deploy these. And so we’ve now deployed some
things that are available, you being used by 4500,
clinicians have scanned over a million patient records are deployed in over 90 clinics. And, and when I did all this,
I didn’t realize how hard it was gonna be and and sort
of some takeaways from that. So one, like when we write
papers, we very often start with, you know, very simple, clean, kind of single objective functions. But what we found is when
you’re deploying something in practice, in order
to triangulate whether it’s really working,
you need mixed methods, you need the ability to
triangulate, accuracy, reliability from a variety of different ways, because often they’ll be
notion of a gold standard, especially in medicine is very unclear. So it’s a combination of all the way from, you know, measuring
generalization carefully, because there’s lots of different ways in which you see shifts in the environment shifts in the patient population, shifts in the ordering
patterns of positions. So you really want to be able to learn, like, you want new algorithms, and it’s led to a slew
of really exciting work all on trying to be able
to provide algorithms that provide stable estimators
or provide estimators that generalize and are robust
to these kinds of ships. But in addition to that, methods that can measure in real time test time reliability, and auditing. So you know, when something
is becoming available to a user, can you somehow
assess whether this one is at risk of being a place where you really aren’t
all that, confident? So basically, really deeply
engaging and practical, work has given me as sort of
a level of a bit of a humility with, you know, sort of
needing kind of multiple tools in the toolbox all the
way from both quantitative but also qualitative points
of view to be able to assess, to approach the problem to
see, well, how do we get rid of the many noisy forms of you know, how do we tackle the
bias that’s in the data? How do we measure whether
it’s working or not? How do we put systems in place to monitor? And how do we try to
think about maintenance, from the point of view
of has the system drifted so far that it needs,
retraining, retuning? All things that I think are very exciting. – So following up on Suchi, that actually, I believe
this is very important. So it’s also interesting to
understand to try to understand what is the role of humans? After all, if data is so powerful, and I actually, I actually
have faced these debates all the time, for example,
at JP Morgan, we like have, people that are real experts, corporations have all sorts of businesses. And then they also produce
produce 80 billion records of data of transactions
all over the world. So there is this thing that
it seems that in principle, the 80 billion records or
the hundred billion records, or the trillions of records that capture how many transactions
that the firm produces, would have all the
knowledge would have all, the models of how things work. I mean, because it’s so large,
it has spent so many years it has spent all the regions
of the world that you wonder if you have a machine learning algorithm, an AI algorithm that
analyzes all that data, are we closing all the knowledge involved in this process, and it’s interesting to, to realize that somehow, even these enormous amounts of data, does not capture everything
that eventually the humans know. And so I am fascinated
also by understanding these trade off between how
can the humans contribute to fill in what inevitably,
maybe the data doesn’t capture. So there is this thing about
us creating more and more data, we have Fitbit, we have GPS, we have all our health,
we have all our finances. And on top of that, there are still humans and that’s really beautiful that we don’t have that we don’t have, reach to the point where by
capturing all these data, we are capturing everything that humans can actually understand. And for that, for example, I tell you, there is something
that I find fascinating. So if you have not watched
that movie of the… Which is a true case of
the landing on the Hudson, Sully this movie, Sully, you actually I assign his homework number
one to watch this movie. And why is this so compelling? So this man takes off from
LaGuardia, there is a problem, I need about 300 seconds,
it’s not more than a couple of few minutes, makes
the decision against, against even guidance
from LaGuardia to go back and another place to go
New Jersey, in Jersey City, whatever it is, these man
decides to land the plane on the Hudson. And you know, for the magic
of these, these human, and everybody else survives
these things, fantastic. So his ask, how did you make this decision given that you have never
landed on the Hudson? There was no data ever to be learning from and there is this thing and
I tell you go to this movie, I assign you a as a homework, I’m telling you. And there is this term
that these Tom Hanks says on the on the movie, “I eye balled the situation.” And I have these words in my mind, thinking what type of data
will enable an AI system to one day eye ball situation? Was not measuring anything was somehow had done these when he was a? I don’t know a pilot in the old days? fighter right? What in the world is
this eyeballing thing? So when people like think ask us… I mean, is data machine
learning AI going to, take over the world,
that humans and thing? No, just go to this movie and see. (laughing) How it says I eyeballed the situation. And none of us writes papers on, how does AI eyeball a
situation based on data. So that’s what’s beautiful
about humans, data, machines. And yes, we can capture a lot of patterns, repetitive repetition, we can help a lot. But there is this magic,
that our DNA enables us, to eyeball situations,
and to make decisions that are not definitely supported by data. So that can put us all to we have jobs for the rest of our lives, I guess, you (laughing) – Yeah, go ahead if you
want to say something. – Can quickly respond with
this is I think the excellent example of discovering
knowledge or having intuition about the system somehow. This eyeballing this intuition
that will humans developed. And I think as a scientist,
we have this scientific intuition, like Einstein
didn’t have data, right? He just eyeballed, I
mean, he’s very smart. But he just eyeballed a
situation and figure out, general relativity and you know, GPS will be wrong by quite
a number of you know, feat if you don’t have it correctly. So how do we use that to further the thing that I think Suchi also
talked about generalization, or Sham talked about transferability. I think there’s certain intuition
here that might be in play to come back to make
transferability and generalization possible and more, you know, broadly applied in machine learning also. (mumbles) – So I mean, I think the intuition is, is important here’s a case made that there are disciplines
where things like this, might occur in the broader realm of machine learning control. So you know, one viewpoint
of ML is we’re just doing this predictive learning. And this is clearly an
outer distribution event, and you can’t do anything
in this usual IED setup, but if we think of something like controls this area called like,
model, Predictive Control, and I would argue that one
of the very successful ways of dealing with robotic
systems is you have a model, which does look ahead planning based on the current situation you’re in. And this works very well,
like in these toy robotics, even real robotics examples, you kick some robot over
and it’s in some different configuration, the thing
that’s trained on a setting where there’s no kicks can’t do anything, but the thing that does
look ahead planning in this new situation, even with relatively
crude models can adapt. So, the sense of how do we
mix planning with crude models in the systems actually, there is a case you may that
these things are pretty robust, out of distribution events, and the entire area of
MPC planning in controls tends to be about stability and
model mismatch situation so. – so building on this
discussion of methods, so one of the topics of
this session is methods? So we have to broach the topic, what are the methods we
should be paying attention to? What are the methods are over hyped? How have methods transformed what you do? I want you to start. – You want to start? I’ll start.
– Yeah. – So since I’m fresh from
my CMO from yesterday. (laughing) where I gave an equally passionate. (laughing) Three different talks on the topic, but again, I think going back
to the thing had said earlier, I really would love I mean,
I think, in the work we do, as scientists, we, you know,
data science scientists, I think, to us, engineering
is a dirty word, I grew up very much thinking, it’s completely hilarious to me, that like, we think, oh,
but that’s engineering, you know, they love those. That’s, that’s just engineering,
you don’t want to do that. And, and I just think it’s hilarious. And by the way, I’m appointed in an engineering school,
also in the school, you know, also in stats, which
isn’t a different school. But I think going back to this idea, there’s just no way, we’re
going to succeed in building and forming and influencing important, pieces of how we progress,
we make decisions, we move forward as a civilization, if we can bring back
engineering principles into it, and engineering principles of like, does it do what we expected to do? Do we understand what it should be doing? Do we understand where it breaks down? Can be quantify, and forecast reliability? And I see so little to
no discussion of this. And I really hope that
as we start thinking, deploying, this becomes sort of, a much more talked about topic. And part of the issue I see is that, the only reason people
are even talking about it is because we have more
complex prediction methods now, and and you know, there’s desire to… And you know, a lot of the focus, if you look at a lot of the papers, a lot of the young graduate students, I think the way you tell
the temperature for field, is you ask the second
year graduate students, what did they want to work on? And they all say, Oh, I
want to work on deep RL. And so the question is,
well, what about it? And you know, they’re
very excited about playing with new architectures,
how there in more layers, you can add and more knobs you can tweak. And that, to me feels like, you know, design without any engineering principles. – let me follow up here. I’m an electrical engineer (laughs). Computer Science is in a person. But in the old days, I used
to be an electrical engineer. And so one thing and
I’m still an engineer, one thing, I think that it’s
really important for us, is to come back to the
data and the methods, there is these very powerful methods that we have now these
reinforcement learning methods, or methods that enable us to
combine data with decisions and feedback, which shall
reward which are very powerful in trying to close this
loop of analyzing data and using the data and eventually, observing the use of
the data getting reward, and staying in this cycle. However, the power of these techniques, these methods of reinforcement learning that we all use one way or another, is very kind of like dependent
on the ability to explore the ability to actually
let the machine not, while learning not just
use what is already knows where to try to navigate
that gigantic space, and try to jump around trying
to find the optimum somewhere. Now, when we are in the real world, imagine me like being a J.P.
Morgan and trying to say, Okay, let’s try to
experiment, give a loan, don’t give credit card,
do these in the real world to see what is actually
how to learn this policy this optimal policy. So we have these techniques
that analyze data and extract to predictions classification, which are fine in terms
of like how you use it, to use these data to actually
do reinforcement learning. I don’t think we can
survive without simulations. We cannot do these search
of these gigantic space in the real world. So what I J.P. Morgan, for
example, we are building like this simulator of
all these operations, in the bank eventually to try strategies to train the system on crisis. Because the real world, even
if it provides a lot of data, unfortunately, it’s boring
data most of the time. I mean, it’s like what happens every day, it’s like the same thing over and over. So it does not prepare an
AI system for the unknown, which is really when
you need to be able to, have an AI system that says, Oh, “this is like this rare disease
of cancer or something,” “alert the human for something.” So what I’m saying is like this, we… and I talked with Suchi yesterday on the airplane about these. I mean, we need to invest
on great simulations. So while we look at the
data, if that data enables us to generate realistic simulations, where we can try all sorts
of now stretching the reality of the world in all directions, we may be able to, in
fact, have an AI system that can discover the cure for cancer an AI system, that can end poverty, then an AI system, they
can do amazing things, because it’s able eventually
to a try all different dimensions everywhere. So I think that our
methods are very powerful. But the connection between our methods and the actual impact
of the using the data. Unfortunately, even if we have
billions, trillions of data requires us to walk so
worry on simulations. So reality is generous in giving us data, but we are not going to
be able to just build upon the data to make great decisions. One way or another. – Kakade go ahead. – Yeah, so I like Suchi. I also just came back from my CML. Definitely recovering
from a lot of the hype. And so before I touch on that point, I’ll also just reiterate
what Manuela said, this point of utilizing simulators. There a variety of fields
where people are starting to think about this more. And how do we incorporate
some notion of look ahead and real time systems, and
this is critically important. It’s clearly utilized in other areas. And there’s, I think, a
kind of a rich interplay between this domain specific question, and how do we use various
knowledge we have about the world in real time now. Maybe I’ll briefly talk
about this hype question. So and with regards to methods, it’s sort of difficult
in that some of the most, impressive results we see, are also coupled with a lot of hype, and maybe I’ll give one example. So less than a year ago, there was, I think, a pretty
remarkable breakthrough in natural language processing
from the Allen Institute to AI2 utilizing this
particular like language model, call it the ELMo. And it kind of worked very, very well in a number of different
other tasks and Google built, and even more impressive one called BERT and did even better. And, you know, the results
really are across the board on human level, interesting
language tests improved. And more recently, there
was a pretty spectacular language model from OpenAI called GPT-2, you look at the generations
from these language models, and very, very impressive. Even many of my NLP colleagues were very, they were kind of shocked
by her good generation. So they trained the language
model on a ridiculous amount of data for like weeks on very, very powerful compute
resources they generate, then they see the language
model with like one sentence, and see what it has to say, following up on that one sentence
or a couple of key words. And it really does keep
track of various long term dependencies, it’s pretty coherent. And the results are spectacular. And, and yet, when it’s released, it’s coupled with a lot of hype in that it’s released on a blog post. It’s a bit of a shame for post, because the this release was saying that, oh, we’re not going to
release the full model, because we’re concerned that
it’ll be used for bad purposes, because the language model is too good. And it really reads in
a pretty distasteful way from a scientific viewpoint. But on the other hand,
the results themselves were extremely impressive. So there is this coupling
between kind of hype and substance and
sometimes you get the two, just completely together without
even a release of a paper and that I would say is one example. And it’s worthwhile having a look at this, but the results are pretty spectacular. – I will touch on what’s the
most transformative methods. As far as a scientist point of view, because I’m coming from the science side, it also really strongly
draw on what people have just discussed,
especially on simulation and search and optimizing and generalize ability and to the ability. So for scientists was most
transforming method exists, and hopefully it will exist, is the one that can be used to
discover true new knowledge. And what do I mean by
true new knowledge, right? so I was talking to push me, that dean marly recently
and we were discussing what the three set of
problems that are interesting and we should really work
on one is simulation. The other one is search and optimize. And the third one is
actually knowledge discovery, which is really, really hard, something that’s not in your training set, going back to the anomalies, coming back to that one thing
that you need to stimulate to get it. So how do we do that? And is there a method that will
discover truly new knowledge that is interoperable, so that we can bring back
and explain to others how we came to that which
also touch on, you know, can we make it not so distasteful, but actually something applied, and people can reproduce the
data, we produce the method. So I think if that method exist, definitely get a lot of
scientists out of jobs and guess myself out of a job. (laughing) – Alright, so we don’t have
a whole lot of time left. Sham kind of touched on a topic that we’ve run into there,
there are too many papers, and ICML, it’s my students are just trying to summarize abstracts, and
it’s still taking way too long. And on top of that, we
have publications and blogs on archive and it’s
becoming like the paper, something that happens
long after the results have been disseminated. Maybe some thoughts on how we
can address this as a field. – You want to take it up, start off here. – I guess we better apply our own science to the problem, right? (laughing) So it’s a we are all
doing AI and data science and now one of the
targets is our on papers. I mean, the whole, the whole, the whole world is contributing advances. – [Joseph] So my students
are making strange claims, like we should be posting on GitHub and focusing on getting
users of our software before we focus on the results,
publishing results and– – Yeah. – And I, that’s different than the way, I would look at the
world, is that something we should be thinking about? – So maybe a broader point is that, there’s kind of the
traditions of how we work in academic institutions and concepts like tenure
how we judge people, and then there’s a separate question of what’s the most effective
way to disseminate knowledge? And I think often these two
are coming in to contrast and couple with the,
with the fact that look, people are people, we want to, you know, the the inherent nature to
try to stake your claims on things and just personality. So it’s important to have some ways to try to mitigate these effects. And, you know, there’s
some argument that I’m not, I mean, I’m reasonably open minded to, fast publication cycles to get ideas out, we should just have a little
more of an understanding of what our objectives in doing this, because I think is a reasonable model to try to communicate ideas quickly and I think the field has benefited by a lot of communication and certainly, just access of GitHub and archive supports this way of
communicating knowledge, that isn’t necessarily the best way, in which we can attribute
certain ideas to certain people. But so there’s, you know, I think there’s just a
lot of social factors, that compound these effects. But I’m reasonably open minded to, kind of the, I guess it would
be higher P values in that… I think it’s kind of okay to
have more failures and papers. As long as we reproduce things
and get ideas out there. – My student has a crazy idea, which is to use AI system
to look at all the papers and extract important
things to summarize for us. So I don’t have to do any work. – That’s a great plan. All right, we’re out of time. So let’s thank our panel once more. (audience clapping) say again, oh so we have
a break till 10:35, yeah. So we have a break until 10:35 sharp. (mumbles) – [Woman] I was recently here (mumbles). – [Man] Oh really? – [Woman] I just come from (mumbles) I have never meet you? I was in physics floor– – [Man] Yeah okay, we don’t
get out much (mumbles). – [Woman] I don’t know, I have a (mumbles) but I know the crowd actually, but I don’t think I have ever meet you, somewhere in (mumbles) – [Man] Do you still
corroborator with (mumbles)? – [Woman] I still do, but
not I mean mostly (mumbles) was a like an experiment – [Man] Yeah, it’s actually in my lab. (mumbles) – [Woman] I want to
learn about doing current like quantum computing. – [Man] Oh, yeah that’s– (mumbles) (mumbles) – [Ryan] Is my microphone on? I guess it’s not on yet– – [Xiao-Li] No it’s not on
they will turn it on for you. – [Woman Moderator] Please settle down. please settle down, we’re gonna start now. – [Ryan] Hello, everyone,
everybody get seated. We’re going to start
the next panel session. My name is Ryan Tibshirani, I’m a faculty at Carnegie
Mellon University, I have a joint appointment and
the Department of Statistics and the machine learning department. I’m thrilled to be here. I’m really excited for this, next panel. We have an all star cast
of esteemed panelists, and to follow suit from the last panel am gonna have them introduce themselves. You guys can maybe say you know a word or two of your research interests as well. So please go ahead and start. – Should I start?
– Yes. – Is the mic on? My name is Xiao-Li Meng, and I am a professor of
statistics in Harvard. But I’m here actually, on behalf of IMS, I’m the president of IMS, I’m particularly thank, the
Vicki Hansen and the ACM for really having this
great partnership here. And I understand that
we have a very small, presence here as IMS
members but but it’s okay. Because you guys like you know yours, you have 100,000 members,
we have 4000 members. So proportionally, we’re not that far. So I just want to mention that. And I also want to mention
that I am also the founding editor in chief of Harvard
Data Science Review. And the reason I mentioned
that is not to just to plug for Harvard, because it will be relevant to what I’m going to say later. But I do want to mention
that it’s much harder to get into Harvard than
to get into Stanford. (laughing) – My turn, so my name is Bin Yu. I’m a professor statistics and ECS, also computation biology, at Berkeley, and actually always, took
connect myself to past president of IMS and surely Kelly
mentioned my president address, called Let Us Own Data Science I’m very glad to be here. And my research, my group,
we have been really trying to do two different things. Why is to really embed yourself
in scientific research, genomics and neuroscience, for
example, precision medicine, the same time viewed unifying
principles and framework that will be pleased to go
to the whole data science pipeline or workflow. And so this panel is special
stability is something I have been advocating
for the last decade. So very glad to be here. – So good morning, everyone,
my name is Richard Samworth, I am Professor of statistical
science and director of the studies toolbar tree at
the University of Cambridge. I’m not the IMS past or present president. But I am the CO editor of
the annals of statistics, which is one of the IMS journals. I work on a variety of
problems in non parametric and high dimensional statistics, including shaping strange estimation,
variable selection, classification, lots of other things. – Hello, everyone, I am Aleksander Madry. I am a faculty at MIT. And I work on quite a number of things. But I think the most
relevant things for here… For this setting would be that
I try to make sense of why, this modern machine learning methodologies like deep learning works. So I call it like
science of deep learning. And the other thing that
actually turns out to connect to the first one is I
think about how to make our machine learning
more robust and reliable. And I want to mention that, I want to make a claim
that it’s even harder, to get to MIT to into Harvard. (laughing) – Thank you very much. – So I should say that Berkeley
is the hardest to get in. – [Audience member] And Carnegie Mellon. (laughing) – No comments. (laughing) – Okay, so now that we’re done discussing the pecking order of universities. Our topic today is
robustness and stability. And this is the format
that we’re gonna shoot for, although, of course, you
know, can go off the rails and that could be fun to. I’m going to float two questions and our panelists are
going to discuss them. And then I’d like to save like 10 minutes at the end for audience questions. So I really encourage people
from the audience to come up, and especially in the last 10 minutes, ask, some tough questions. So let’s start off just
by setting the stage. Let’s have our panelists discuss, what do stability and
robustness mean formally? How can we think about them? Have they evolved in statistics and in computer science differently? Are there important differences
that we should be aware of? And what can we learn from each other? So in any any order,
anybody can be an issue. – Okay, thank you. So, I like to define
stability or robustness in a medical level and
so you have a process, your input and your outcome, and you have a metric
to measure the outcome deviation from something. So stability, and the
medical level is about you interpreted the input in some way, and the outcome changes
and you observe this change and also at a high level,
you also have two issues. One is to measure the
stability of a given procedure. The other is actually enforced stability or robustify things. So these two issues, counter
play with each other. And at the philosophical level, I think common sense, you
know, stability and robust, you can go buy philosophy of life, Plato say something I
knowledge has to be tied down. But your opinions can change over time. And historically, I think, statistics and certainty to a specific form of stability, in the sense that, the perturbation to sample
to sample variability under a probabilistic model and variants, L2 can be a measure of stability, and maximum likelihood, you can see the way actually
to stable procedure, we don’t see that way. But if you look at all the
parameter estimation measures, maximum likelihood and the conditions has the smallest variant,
so the most stable. And today, we really think
about much broader sense of stability and you have
robust control theory, which is also kind of
now under the purview of our say, data science at large, and also robust optimization. And economic has this field
called rational expectation, they really concern
themselves in specify models, and machine learning our think that, at least goes back to the
70s people talk about, agree miss-stability, which
is actually a particular form of Jackknife goes back to the 40s. And you also have Word in the 30s. I think, particular Alexandria
work goes back there. And Leo Breiman had papers on instability. And I think it’ driving force for bagging. So all of that is really
answering this question addressing or studying
or robustifying things. And I don’t see statistics
and machine learning as a separate views
for many years already. I feel they feed on each other, and really is the power of integration, I think, make it powerful. And my group agree with me, adding stability to
principles machine learning, we call PCs, predictability
and compatibility. – So I would say robustness
is a very classical topic in statistics, it’s been around
for several decades already, people like handful and Hubert, Rizzo and many others made a
lot of contributions. It’s kind of the idea that we
want our methods to do well. (beeping) (laughing) (mumbles) – Out of time already, we want to do– – [Warning system] May I
have your attention please, there has been a fire emergency
reported in the building. For this here by please leave the building by the nearest exit— (mumbles and laughing) – That’s an example of something unstable. – It’s a participation. – It’s a participation. – It’s a huge participation. it’s a stress test– – See Richard I told you
I won’t let you talk. (mumbling and laughing) – [Bin] I know it’s a causal inference. I think, you caused it (laughs). (mumbles) – [Man] That’s a great example from. (mumbles) – [Woman] Ryan 15 minutes into lunch. – [Ryan] Okay. – [moderator] Okay, everyone,
we’re gonna to start again, we’re gonna eat into your lunch. But there was a little flop anyway. All right, Ryan, you’re on. – [Ryan] Okay, welcome back. There’s some joke here
about that fire system not being robust or stable, if you can fill in the blank. Richard, you were in the
middle of saying something. So I’ll let you maybe
start from the beginning. – Careful Richard. (laughing) – Okay, thanks very much. I have to say I’m impressed
that normally takes my students at least three minutes before
they start getting so bored. They hit fire alarms. (laughing) I think for the Emmanuel
called a false positive. Yeah okay, so it’s
talking about robustness being a classical topic. And the idea that we want our procedures to have good performance for a wide range of data generating mechanisms. And related to that, is
that I don’t want any single outline observation or a small proportion of corrupted observations to influence my estimates too much. So there are ideas like
the breakdown point, which is, how much of your… What proportion of your
observations Can you corrupt before your estimate
becomes arbitrarily bad? So the mean would have
a breakdown point zero median would have a breakdown
point of a half, for instance. And then another popular model is huber’s contamination model, where people ask how your procedure does when you allow proportionate
silence of your observations to comfort, some arbitrary
different distribution. stability, I think is,
quite closely related, but a little bit different. My thinking there is it’s closely related to what Bin was talking
about and data perturbations. And somehow the feeling
that the effects I trust are the ones that I see very frequently on many different
perturbations of the data. So it may often be the
case that I can aggregate the performance of a procedure over lots of different perturbations. Maybe these are some samples
or bootstrap samples, or random projections of my data into some low dimensional space, or adding artificial noise to my data, or in Emmanuel’s world, it
will be creating knockoffs. These are all ways of perturbing my data. And I feel like the aim here
is to be a little bit less, model dependent in the
way that we’re behaving. So that seems to be one one
of the goals of stability. And there was one little
additional part about, how different between
statistics and machine learning? I’m not so sure on how well I’m qualified I am to talk about this. But it seems to me that in statistics, we often tend to work
with random corruptions with huber’s contamination
model or something like that. Whereas in machine learning, people talk a lot about
adversarial corruption. So I think these are both perfectly. Both have perfectly
reasonable things to explore. – Yeah, so just to give
you like a bit of an ML kind of view of what robustness means. So I guess it might be the
most useful to actually like even, like, think
back at how it started, or how it accepted this
topic, gained prominence and kind of this was this realization that like maybe like a four years ago, that essentially Yes, we
have this great classifier that seems to be doing amazing things. However, you know, as much as
they are doing amazing things, the predictions they provide, you know, even though they are good, on average, they actually extremely brutal, and kind of this one of the
examples of this episode examples where, you
know, you can just take, let’s say, a picture of a
pig that looks like a pig in to a human and the
classifier of events is a pig. And then you just can, like,
change it in a very tiny way, just changing each pixel only slightly. And suddenly, the classifier to ask, it’s still the same pig, but the classifier believes
it’s an airplane, right. So I usually, of course, make
the jokes about you know, AI being so powerful that
they can make pigs fly. But of course, when you like
for not at first is amusing, but then you think about it, okay, we are using this visual system
in our self driving cars, and now the video system is such that, you know, just small tilt
of the scene might make it, you know, not see the petition
in front of it anymore, then it becomes very serious
and very acute issue. And essentially, like once we realize this is something that
we really don’t want, we have to think back and saying Okay, why do we have in the first place? And then we realize that you know, the standard notion of diarization in embracing machine learning of like, minimizing the population
risk essentially you want if you distribution and if
you suffer from distribution, you want the loss of your
customer to be small and sample that this is not enough
because it doesn’t tell us, anything about this brittleness because you usually this perturbations they are of essentially measures zero. So essentially this is not changing the expectation any important way. So, that led to a kind of definition of, you know, robust generalization, which essentially says, the most obvious things that
you would think of is saying, Okay, now, when I sample a sample, I don’t just look at the sample as a sufficient sample itself. But also, I look at the specification for all possible perturbation, which we can some well
defined perturbation set of small perturbations, and you measured the
worst case over doubt, and that kind of the production on that, you’d like to be good in expectation. Okay, so that’s essentially
where it started. And that’s what like classically
robustness in ML means, of course, since then,
you know, we realize that there is many other
things that are broken with machine learning
than just this particular our prediction, inference time attacks, and somehow, I just to, just to call it kind of
compared to what happened statistics, like I feel we are
kind of converging as a field on the on the same track, like in some ways statistic
is way, way, way ahead of us, but kind of tracks seem to be converging, because now we realize, okay, there’s this whole data pipeline, and there are like data poisoning attacks, there are all these things
that will robust statistics is, but it’s part of the answer. So there is a lot of convergence. And also, I might say that
kind of initial realization, this kind of is in tune with what Emmanuel is said today in the morning
is that kind of a currently, I think this problems of robustness led to some kind of
identity crisis for ML, at least in my opinion, because
we started to realize that, you know, it’s not only about how, how to make our classifiers learn, but actually what it is
that we want the classifiers to learn in the first place, because we realized that kind
of the way our current models succeed is very different than, how we as humans, envision them. And there is some very
nice kind of understanding of trying to reverse engineer
healthy mechanics of you know, how models work, how to
make them more human align, how to make the interoperable, so there is kind of a lot
of things that kind of, machine learning is discovering now, like how to find the positional data, but I think there’s a lot
to learn from statistics, because, some of this work was
already done, in that field – Oh as a statistician, I’m
really happy to hear that, particularly is that this has
been way ahead of machine, unless I’m kidding, in this aspect, but not
really kidding but seriously. But you know, let me define the, let me give a very precise
mathematical definition for me into both robustness and stability, then I’ve talked a lot
more like being more sort of philosophically, to me is
about the derivative is small, whether it’s a black box, whether it’s however you
define is more general concept of derivative, as you
all learn how calculus, you know, you know what
the concept means that, that the kind of rate of change is small. And now for me, the as
Richard also mentioned, the, the robust notion was a notion that in statistical long time,
and it was initially was more put together as a notion to guard against, modern specifications but
for those who understand that literature, you
know, that at the time, there was a pretty big
debate about whether these robust methods are really good, because they could, it
could be that these outliers are actually the real discoveries. And often that is the case. So you see that, but there’s
no free lunch, right? That’s the whole that’s
the whole issue of, doing machine learning, oh, statistic or any of those things. So for me, I think that the
post notion of robustness and stability, by the
way, stability, to me, is the notion more
coming from the sort of, the computer science, the computation algorithm,
the iterative algorithm being stable, you know, that’s
the notion that as I learned, but for me that I think if you
think about the big picture, both the robustness and the stability is pretty much everything we do. If you look at what the
Emmanuel, talk about today, when you have a black box, you want the black box
to be very sensitive to the right signal. ‘Cause that’s what the
predictive power comes in. But then you want it’s very
robust to the wrong signal. Right, so the whole whole notion of the, it’s depends on that’s all
make all of us have a job, to separate the which is which. And what I want to specific
is talk about one specific example, which I think is
really will be a really great example of how computer scientists and statisticians and
others will work together, to address this, really
big problem coming. And as many of you that this
notion of differential privacy was created by you know, Cynthia Dwork and others from computer science, which has been incredibly
important because, protect privacy is such an important issue in this digital society. And it has become such
an important tool that, the US Census Bureau has
announced in for the next sensors 2020, all the data released will be.. Whizzed the differential
privacy protection. Now, on one hand, this
made a lot of people happy because of our data being
be private, so on so forth. On any hand, there are a
lot of social scientists particularly gets very worried now will be I know Bin has
been worried that as well. Now we’re going to analyze
all these data is basically you tell me that no data
that will be real data, it’s all with noise infused. The notion of differential
privacy itself , is defined according to
a robustness, stability. That’s what actually I
talked to things here. That’s why literally called
differential privacy, essentially, is saying, if
you whatever system you have, by taking all the one person, if the change you can see is very small than does protect a person’s privacy. So essentially, it’s a very
stable was this one removing, but then, of course, you know, when you really implement this
thing in this whole sensors, the lot of implication, because
you’re doing that by adding noises in a controller way. And later, when we get to
talk about applications, I want to say specifically, since I got involved about
how do we really think about this big trade off between privacy, which is a form of
stability, and the utility, the accuracy of the of
the scientific results, which is actually you
don’t want to stability, you want have very sensitive
signal to the things happening in sort of disease, you know,
predictions, all those things. So that’s the fundamental a trade off and I think that’s a very exciting area that we should all get into. Because you know, it’s 2020
sensors is coming next year. – Okay, perfect segue. Thank you, Xiao-Li. So, next I want to discuss, whether or not the definitions
that we all just talked about the kind of the foundational
ideas behind robustness and stability are really aligned with what we should
care about practically. And to discuss whether or
not there are things about, you know, practical
issues that should make, its way back into the foundations. So people are welcome to discuss examples, and their own work or
in other people’s work. – Go ahead.
– Okay, I got it. Okay, I can start. So like the answer is that,
you know, essentially, almost by definition, what we can mathematically state precisely is never exactly capturing
what we need in the real world, it’s always trying to chase
the concept that we have. And I think in the
context of the robustness, this seems to be even
more kind of more true than in other contexts. So in some ways, again, when we say words, like robust and stable, we have
an intuitive meaning of it, or even if we think of
a word generalization, we have an intuitive meaning of it. But actually, when you look at
what we try to put in words, you realize that this is not, capturing much of our intuition, and exactly like, this
is exactly what happened with like, standard, like
population risk minimization, was not capturing this kind
of intuitive robustness criterion that we as
humans thought was obvious. And now even with the robust, like, the robust
generalization definition, well, you have to ask a question, okay, what kind of perturbations
you should be robust to, and again, there are some
obvious ones for images, we have changing each pixel by unit one, clearly you should be robust to that, but then there are rotations and there are local transformations and all of that is not really
kind of captured currently in not really explored
from this point of view. And also, as I said, like general like, you know, this whole notion
and intuition of robustness is kind of trying to, like, how, what does it mean to generalize? there was a whole panel
that ICML yesterday that I was speaking in,
which was a question like, what does it even mean to generalize? And that’s kind of the question that, we don’t have time to
go into what is actually very interesting, and please
do quite a unusual conclusions. – So, for me, actually,
my interest in stability came very much from interested
in research in neuroscience, we’re working with Jacqueline
like we have this very well published work to reconstruct
movies from FMRI signals about seven, eight years ago. And then I wanted to
do some science, right, it’s kind of goes back to what
Xiao-Li was talking about. Now, there’s a name we call
scientific machine learning, that is really trying to get knowledge, out from this was a predictive we can call engineering exercise. And the thing is how to
interpret and at the last step of this pipeline from
basically cobalt transformer linear model that’s who, then I want to interpret. And then if you change the data by 10%, was kind of arbitrary
how long they recorded this FMI experiment with
people watching clips of movies and measure FMI signals, and then the selected parameters, like a feature to change a lot. So that’s kind of the
beginning about 10 years ago, my interest in stability is
really for interoperability. And I think wrapping things around black
box, definitely useful. But I also think presenting your research, we need to make the black
box gray or transparent. And that’s through
integrated machine learning. And in terms of where we
are now, for my group, we really now seeing
stability as a very much, unifying principle from
problem population. You have an increased
ability you have to worry about with multi discipline
people work together. Do you mean the same things the matrix. I was talking to a biologist
at IBL for her matrix? In her theory of cancer, research, it means biological substrate. And for me matrix means we know a matrix. So our DNA tried to
talk to is Amino B cell, we couldn’t really quite get across, I think I never was clear to her that, I was talking about different matrix, it was very short time. So I think from right there,
and then for data cleaning, or something, I think one
of the panelists mentioned that the buts right,
would you remove things, and that’s something we really
have a have a formal way to address as a community, how can we make that step transparent. And also, we know that there’s
a lot of human judgment calls that we make in the whole
data science pipeline, or life cycle. And so we try to use what’s
called PCS documentation, predictability, computer
information Learning is really interesting to me
stability as for stability, and really get people to record. Why you make these choices? How do you get data? In bioinformatics, it’s very common people have multiple ways of cleaning the data. How do you make the choice? So really try to make the whole process as we practice is not
just hardcore engineering, mathematics, statistics, computer science, there’s so much our core liberal
art type of judgment calls, can we bring, that type of persuasion, or our core empirical rigor
into the whole process, just recognize that we’re not 100% riggers and the documentation, I
think it’s the important part note just code, but
also the decision making that to put it together and through integral machine learning, I think we’re really
bringing the whole enterprise forward, especially what Emmanuel called mission critical application. – So I’d say that we’re
quite used to being able to adapt our definitions according to what we can actually say. So to give you a concrete example, I was working on a problem of, low concave density estimation. And I found that if you
corrupted the true distribution with another distribution, if they had a finite mean,
everything worked beautifully. It’s already nice to just what you want. But as soon as that
corrupting distribution didn’t have a finite mean, you’re dead, you couldn’t use the procedure at all. So I could either report that
this procedure was not robust. Or I could change by definition
and say that it was robust, provided you corrupted with a distribution that’s got a finite mean, which do you think I’m gonna to do? So yeah, I think that’s, that’s
kind of how we often work, we often adapt our definitions, and we tried to make, as long as we make clear
what statement we’re making, I think that that’s kind of kind of okay. But it is still useful
to have these kind of, broad definitions and
I think, either working with a random contamination
model or an adversarial one, these are both quite,
quite natural definitions, I think to work with and definitely both have
a contribution to make. – So I wanted to continue to talk about, this particular project got
involved on differential privacy with US Census Bureau, because I think there’s a lot
of really interesting stuff, I want to engage both
the computer scientists and machine earners and statisticians, everyone because this is, a this is the ones the
Census Bureau released data with differential privacy, injected, you will see all
kinds of data out there. And you really should pay
attention and should help them, help the researchers. And this is why I
mentioned at the beginning that I’m the founding
Harvard Data Science Review, because I’m getting
involved in this project through this journal. And I want to give a shout out to… This is a part of the Harvard
Data Science Initiative. It’s co director is here,
David Parkes sitting there. David is gonna be a fellow tonight, right? Congratulations, David. The but let me say specifically,
what this project is, I don’t know how many of
you know that Census Bureau census United States Census,
they have this policy that their data is are
private, for 72 years. But after 72 years, they can release. So what they have done is they
have released 1940 sensors. So the 1940 census out there. And most of you probably don’t know, they just recently released the code, they’re gonna apply to the
2020 sensors to make the same differentially differential
privacy protected. So now, any researcher
you can do the following, taken 1940 data, analyzing
a way whatever you like, to analyze, apply the
same analysis message to the differentially
private protect the data. Now here the question is
really a stability question. You want to see how
much changes are there? For most scientists, most
social science particularly, they want to see as
little change as possible. They’re worried about the impact of this differential privacy. On another, you know that if
there’s really not much change, the Census Bureau has not done a job. It was wasting time to do all the same, you will see there were big changes. So what are we doing teaming
up with Census Bureau is we’re organizing two sessions
for this annual conference, which is the Harvard Data
Science Initiative conference. And I was given one day to
organize the conference, we organized two team, us social scientists work independently. There we both applying their
methodology to the 1940 data without differential privacy protection, and 1940 data with differential
privacy protection. And you see what’s going to happen. I tell you that we are all very worried, particularly the person who
was in charge of the census operation, were really
worried because it could be you show you will see that
and we know as a statistician, I hope you most of you would
have the same intuition. What’s gonna happen most likely, is there a lot of significant results will become insignificant. Now, for many of you, per
se, that’s a good thing. As if you often see Emmanuel
Candace presentation, lot of this stuff,
significant one probably shouldn’t be significant
because once it’s, but on the other hand, you have to worry about
like do we understand, is this caused by the
differential privacy protection, or it’s really shouldn’t
be significant, right? Now, so this will be the
comparison will be doing, I just got to remind the very recently, there should be a sort of comparison, which is really the
right one to, to do it, that requires lots of new methodology. Once the data released with
differential privacy protection, you shouldn’t analyze them as
if they were not protected. Because you know, they’re
more noise gets injected. And the sense of humor will tell you, they will actually put
transparency there will tell you, what privacy protection
they actually provide. You cannot re engineer that, at least statistically, you can. So the really wise thing to do is analyze the differential private
data, with new methodology taking into account
that kind of protection and then compared to the
ones that your origin must apply to the origin of data without the without the protection, and that it should be
the right comparison, at least one important capacity as well. But we shouldn’t do both comparison. The first one is really
to to show how much damage or how much stability is there. If I don’t make any changes. The second one requires
a lot of new research. In fact, when I thought about I said, Oh, this finally is gonna bring
back a lot of robust methods. In statistic, for example, for years, we have this methodology develop, for example, train the
means, when you take average, don’t take an average of
everyone, you cut off the 95%, you know that the top 5% of the bottom 5% of you do those things, right. And the most time we
don’t use those things, because these methodologies are you know, they are useful, but we find
sometimes they lose efficiency. What’s interesting to me as
a research, and I realized, when you have this
differential private data, these are robust methods. Not only they they is more robust, they actually gain efficiency, because they’re going
to allow these noises, project, inject into it by
these by these protection. Now, how do we get that done? There’s a lot of work there. But what’s interesting is
that that that’s a really very concrete example. And I think we should really
pay attention because once, the Census Bureau release those things, you know, the impact is tremendous. Let me give you very quickly that example, the whole idea that you can take a sample within a population, with a revolutionary ideas
like in the early like a 19, you know, 100, it took a
45 years to be to become the sort of a common
mothodology for social science, you know, where it doesn’t come from 1945, US Census Bureau decided to do sampling. And after that, it just
changed the landscape. And this is gonna happen next year, when they decide to put off
differential privacy there, it’s gonna forever change the landscape. Many of us are worried Marion was excited. But that’s a great research opportunity, literally in the field of stability, verse in unstability. So I encourage you to work on that. – Maybe I just add one more comment. So actually, my group and
collaborate were really successful adding stability on top
of machine learning. So we arrive at something
called for example, integer random forest, that’s base of adding random forest and there we have success to find change in high lab, high order
nonlinear interactions. And we have some interesting
route from Biobank data, as a test case, of working
on trying to predict somebody’s hair is red, because there’s a lot of well known jeans. And talking about stability,
just another level at a human interaction is when we confirm a hypothesis with domain. Like experts, I think we also
have need some perturbation there, we should give
them now just the results we have for this, we actually
also have like kind of fake this for the expert to
really chemo authentic information for confirmation
than just confirmation bias. So I think, again, stability
is really throughout the whole data science lifecycle. And it’s a good conceptual consideration with many manifestations, the users can really adapt to
their particular applications. – All right, at this
point, I’d like to welcome some audience members to ask questions, please don’t feel shy stand
up and ask any questions. two microphones, one in each aile. – Anyone who asks a question
will get one of this. (laughing) all right we’re in trouble now. – I have a question, Xiao-Li. I was surprised to hear
you talk about trim means seems kind of a primitive
thing like to say like, well, this is what we’re
going to do the data because we’re worried about
outlier, whatever, like, kind of Stone Age like
Can’t you just I would think we’d want to model the the distribution of these survey responses, which I guess is what the census is. And like, once modeling
it, you’ll do inference using some approach and
whatever approach you’re doing. If your model allows for various errors that you want to be robust for, then that will come in naturally, I would think that would
be a cleaner approach. So I don’t, I don’t know why
you’d want to go backward in that way. bothers me a little bit. – Graduate School debate continue. – You know, for those who don’t know, Andrew and I, we were
twins at academically, we had a but, so whenever
and just talk about, I know, I’ll be in trouble. But this one I do know how to answer. And because you really have to think about we’re talking about all kinds of users, of this defense private data. If someone if you know, someone like you and many of your colleagues, many of us, we can do much more sophisticated ones, taking into account how the privacy, was what was designed,
built into the model. That’s what we’ll be doing
but on the other hand, that a whole lot of other
people’s who I would rather them to use some kind of a simple method, but protect them a little
bit, instead of just, completely ignore the fact that these data now had differential
privacy approved building. So it’s really for the
for the practical purpose. But you have your your core
for much more sophisticated analysis, I’m all for it, thank you – Can just chime in, I thought, it really depends on what
you want to do with the data. So if you want some average, you know, mean type of thing yes. But if you want more
sophisticated questions, definitely you have to move
away from the trimming. – But most survey, answers are
essentially to get average. If you look, if you notice
over the three will go there. Most people estimate is kind
of some kind of population average, you can write that out yeah. – I have a question. So when we think about deploying a model into the real world, athletes
been trained and so forth, often time, the real world changes. And so I just wanted to ask
your thought on the discussion now in terms of stability,
robustness, how to think about stability, robustness news on the topic, in the context of
changing, changing world. – Yes, so that’s difficult,
because in the end, that’s what we really
are trying to do, right? Because for that, like, you
have to be precise about it. Like it’s okay. So I think that is the biggest
challenge in ML currently is exactly like how to do like, this is called like domain
shift or distribution, how to train models that actually firewall and the distribution shift. And I think, you know,
the problem is that, we don’t have good data
sets to test it again, you can, of course, deploy
things in the real world. And then usually what happens is, like, every kind of use case,
for specific distribution, maybe it’s not a well
specified distribution, but there is a, there is
a way you use the model. And then people essentially,
like they start with something, then they deploy it, and they tweak it. So it works the way we do. And again, what is nice about robustness, that it kind of provides you
the principal methodology to do a bit more better
than that than just saying, Okay, let’s deploy the model
and see what goes wrong. And just like, you know, fix it. And essentially, so what
we discovered recently was it actually buys this
even further that for images for a simple robustness, where you kind of try to make
sure that small pixel changes do not change the prediction
of the model, actually, you end up with a, you know, essentially, what you really are happening
because you are changing the prior of the model,
you’re embedding the prior of the model, and make
it more human aligned. So it turns out that you can
actually fare extremely well image prediction, by using
features that humans do not see and humans do not use. Okay, so just to give an example, imagine that in your data
center is the one pixel that encodes this class,
you know, and you are trying to differentiate between dogs and cats, versus for dogs and cats,
it’s about the background. Usually, we photograph cats inside, we photograph dogs outside. And that’s actually a strong signal, the background can be a strong signal of what is the right class. But we as humans learn to discard it, because we know this is not essential for the Dog to Cat task. The deep learning for
those, they have no idea. They just they just extract correlations, and they will use this feature. So essentially, what robustness
does is actually provides you a way of systematically, kind of preventing the
model from using this kind of school rules, like
meaningful, but school, like best schools features. So so this is just one approach. But in the end, we don’t
have a very precise data sets and definitions of what we want to do. This is just an intention. And this is just as it always
will be domain specific. – Okay, so to keep things flowing, we’re gonna have to end
this panel session here. So we’re not too late for lunch. Let’s give our panelists
a big round of applause. (clapping) – We still have three minutes.
– We still two minutes. I was told by the Emma, – [Woman] No your done. – We’re done.
– We’re done, thank you. (laughing) (mumbles) – [Andrew] Do we have a
special order in which we– – [Yannis] you can seat whatever you want, but the order of questioning
is special, so you can seat. (mumbles) I don’t think so. I mean, the order of
questioning is what I said but the order what you said,
I don’t think it matters. Where is Kristian? – [Man] A little bit lower— – [Yannis] Is Kristian? (mumbles) – [Yannis] Oh, she’s been waved up here. – [Joaquin] It so bright. – [Andrew] It is bright. – [Alexandra] Just feel comfortable– – [Kristian] Sorry, my, well
you hear me, my mic fell off. (laughing) – [Yannis] Okay, I guess
we’re ready to get going. This is the panel on fairness
and ethics in data science, something that we will
care about independent of whether we work on data science or not. And we have an amazing list of panelists. I’ll ask them actually
to just say their name and where they’re from. And then we’ll get going
with with the questions. Let’s start from my left, Andrew. – I’m Andrew Gelman,
and I’m a statistician and political scientist, I
work at Columbia University. – Hi, I’m Kristian Lum,
I’m also a statistician, and I work at the Human
Rights data analysis group, commonly called HRDAG. – continuing the theme
I’m Alex Chouldechova. I’m an assistant professor of
statistics and Public Policy at Heinz college, Carnegie
Mellon University, and I’m also a statistician. – Hi, I’m Joaquin and I
used to be a researcher. My training is in ML. And I spent the past few years
in industry at Microsoft, now the past seven and
a half years at Facebook and building AI infrastructure
for the entire company. And more recently, the last year working on AI ethics and fairness. – And I’m Yannis Loannidis
, Professor of informatics at the University of Athens,
and also the president of the Athena Research Center,
also in Athens in Greece. And I’m also the ACM secretary treasurer. So let’s start with some questions. And I’ll start with the
very title of the panel, fairness and ethics in data science. Is it mostly an issue
of policy of mentality? Or of technology? Alex, would you like to take a shot? – Sure I would love to. In my mind, it’s all of the above and then some, I think
there are some things that, are missing from your list. It’s a matter of decision making. It’s a matter of psychology,
behavioral economics, all these things that we as statisticians and computer scientists don’t necessarily spend all that much time talking
about, or thinking about, and are perhaps I’m familiar with. So I want to go back one step and ask, why are we concerned about
fairness and ethics right now? Like, when did this all started? What motivated it? And as Emmanuel reminded
us earlier this morning, if you think back to the sorts of problems that we were dealing with,
maybe 15-20 years ago, something like the
Netflix prize challenge, right, so let’s recommend movies. And at the time, I don’t think
too many people were really concerned about these issues of fairness, accountability, transparency, ethics. and the technologies
that we were deploying. But now we are and I
think what has changed is that technologies are
now being increasingly deployed both in the
private and public sector to make decisions or assist
in the decision making in settings that are of consequence to individuals in society. And so now we need to be concerned. And echoing something that Bin was saying on the previous panel, there’s an important
translation tasks that I think we need to engage in, this
is where things like policy and the Decision Sciences and law come in. Bin was pointing out that
the term matrix means something different to a
biologist than to a statistician or a computer scientist. But I challenge you to
think about what justice or fairness or trust or contract mean, in statistics, these are things that we simply don’t
usually have definitions of. And yet we are trying to
propose technical solutions to these challenges that are
phrased in terms of terminology that is unfamiliar to us, so
I think it’s really important to engage with these other disciplines, invite them to our events, attend theirs, so that we build a common language in order to tackle these problems. – Okay Joaquin do you want
to say something about it? – Yeah, absolutely. It’s funny, because of
the questions you propose, this is the first one that caught, my interest, because
the the hardest lesson that I have learned, and
most interesting lesson that I have learned the past year, coming up the question of, you know, ethics in general, but I’m gonna talk about
fairness specifically. Was that I, you know, I’ve
realized without a doubt that it’s, it’s a way
more complicated, problem than I thought it was and one of the key, ideas that I’m obsessed
with is that fairness is a process and that I think that, many of these ethical
questions are our processes. And and these processes
should include, key steps. One is to make sure to
systematically surface all the hard questions,
right, and I’m gonna give you a couple of examples in fairness, and to have systematic
processes to resolve those questions involving
all the right people. And most of the time, if
you’re thinking about AI bias, or algorithmic fairness, you will actually not
want the AI practitioner to be the person making the
decision, either because, we’re not equipped to make it, or because it shouldn’t
be us making the decision. And then the third part
that is really important is, to have careful documentation
of the decision processes and what the decisions that
were made work for transparency, obviously, but also to build internally, I talked about building case law, because many of these decisions
we have to make seem unique, So I wanna give you a couple of examples of what this looks like
on the ground in practice. Okay, so imagine that you
wanted to build a system, you wanted to use ML,
to help with the problem of reducing the spread of
misinformation or false news. On a platform like Facebook,
or like any other sort of, web sort of a platform, right. So obviously, you know,
initially, you could say, Well, why don’t we use
professional fact checkers, right? Like, that’s one possible approach, right. But that doesn’t work, because
the scale of the problem is, is pretty massive, right? And then you have additional challenges. The definition of what
constitutes false news isn’t clear in itself really depends a lot
on the point of view, right? And then finally,
professional fact checking is, a process that takes a long time, right? Okay, so now, you know,
you’re, you’re an engineer and you’re and you’re thinking, all right, well, what can we do? Right, and you’re trying to build this out and you think, Well,
here’s a research project and is an actual research project we’re conducting at Facebook,
which is like, well, could we ask members of
the Facebook community to try to point to journalistic
sources that either confirm, you know, or contradict
a claim that’s being made in a piece of content. And when you think about that, you have a lot of hard
questions that start to arise, Like one is like, well, who
are those people gonna be? Is it all gonna be like members
of the Facebook community who live in the Bay Area? What how’s the rest of the
country going to feel about that? Not to mention, you can
try to do this in another in other countries. So obviously, when you
think about about fairness, you have to be thinking
about what are the dimensions that I should be considering? What are the important
groups to think about? Your AI engineer, me in this case, should not try to answer those questions, but they should be disciplined
to ask those questions. So obviously, this is
something that we’re that, we’re tackling, we’re
looking at things like, geography, we’re thinking
about gender, age and we’re thinking about
even like political ideology, you know, to make sure that, there is a diverse representation there. And that’s just the labeling process. But even before that, right,
like, you need to stop and say, well, content, moderation, right, is about trying to protect the community from harm. But what does that do
to freedom of speech? And are there groups in society? Who’s ability to express
themselves, might be compromised? Because we’re trying to do this. That’s not an AI question. But it is going to be a
very relevant question in how I actually build
my algorithms, right. And another interesting
example is the India elections just took place. 900 million people voted,
the biggest elections in the history of humankind, right. And in there, obviously,
we were very, very nervous as you can imagine, right? Because there hasn’t been a
lot of election interference happening on our on our platform right. And so we had a, you know,
another interesting example, on the ground, right of
a fairness question is, there again, you know, that we partner with election monitoring
organizations and so on. It’s been difficult, It’s impossible to
review everything, right. So you need to prioritize and
so there’s a very beautiful symbiosis and collaboration between, ML, which prioritizes the content
that humans should review, and then humans professionals
who are reviewing that content and making decisions, right. But India is a very diverse country. And so then you think, well, if I think about fairness, what if for some regions in
India, or some languages, I don’t have enough data, or my algorithms are biased in some way. And I’m either flagging too much content to review or not enough. Then I’m not optimally allocating, my limited human review resources. So there was another another
practical on the ground, you know, fairness question we had to ask. But again, we had to sort
of pause and say, Well, what are the hard questions here? And it was things like, well, what are the most sensitive
groups in that country? I have no idea, So you need to work with, the right people in the country right, to answer those questions. But very quickly, you realize even that, I’m just gonna to say
something uncomfortable that many of these questions end up being, political questions. And I think it’s important
to feel comfortable with the reality, that
if your reasoning about, algorithmic fairness, you know, issues of, ethics and in AI, you have to escalate some of
these questions all the way up. And, and they’re going
to be very uncomfortable and the right people inside
and outside the company need to be need to be involved. I’m gonna try and wrap it up, with a with a had a lot more, but I want to be
conscious of of time here. One of the things that
I’m really excited about, is to see that both in
academia and also in industry, that there’s an emergence of
very multidisciplinary teams, traditionally, you know, at
Facebook and and Microsoft, I’ve seen teams like
policy or you know, legal, almost as the enemy. You know, you’re you’re
trying to build this thing. And you’re like, Oh, God
damn it, I have this checkbox that I have to take with the policy team and with the legal teams, I
can launch my product, right? That does not work. That does not work if
you’re trying to answer very difficult questions like this. So right now, in my team,
we have moral philosophers. We have political
scientists, we have artists and I think you mentioned some disciplines that we don’t have in my team yet, but I’m really excited about this. And any talking to friends,
you know, in academia, I’m seeing, you know,
departments of philosophy and Department of Computer
Science, building joint programs. and I’m extremely excited about this, and I think that’s gonna
be really important. And I’d like to quote someone
that I’ve been talking to, Sandra Vector, who is who’s at Oxford and she’s actually been studying things like, Explainability. She comes from law, so she comes from a different perspective. And and she said this one
thing to me and of last year, that still resonates right,
as we were having these kinds of conversations, she said, you know, many of these questions of fairness, we’ve been debating them since Aristotle, it’s been a long time. And we haven’t resolved them yet, and she says something like,
“it’s really important that,” “we don’t burden technology
with difficult issues” “that we humans have not
been able to answer.” – Okay, well, I may come
back with some of the aspects that you brought up. But let me go to Andrew. Andrew We talked about
bias and algorithmic bias. What are some ways in which you can think about algorithmic bias
as a social problem, and not as a technical problem? – So I, well, I guess, first, I should say that when
I think about ethics, it’s usually about like, something bad that somebody else has done. But but maybe we should think about, bad things that, we’ve done. So I want to talk about an
ethical failing of statistics as a field or machine
learning for that matter. And I think that we… it’s, that we sell stuff, statistics or quantitative analysis as a way of converting
uncertainty into certainty. And in some settings, that makes sense. Like if you run a Casino in Las Vegas, you really can convert
certainty into uncertainty, but or uncertainty into certainty. Because there you you
have a mathematical model, that’s very accurate, but in other cases, that’s not the case. I think what we teach
is that identification plus statistical
significance equals truth. And we spend a lot of time
focusing on little problems, or, like, the implication
is where we figured all this stuff out. If it’s statistically
significant it’s true. If you have causal identification,
you can believe it. And then there are a few technical things, we have to worry about
extending it to other problems. But I don’t, I don’t really
know that that’s the case. And we also teach that if you
have a randomized experiment, all you need to do increase n, and you will resolve all your problems, because after all, life
is bias and variance. And if you have a randomized
experiment, you have no bias, and the variance goes like
one over square root of n. But that’s not really
right, for two reasons. First, you can have bias
in a randomized experiment, it’s, the people in the
experiment can be aware of what’s going on in the experiment. And also, effects can vary by person, so there can be bias based
on who’s in the experiment or who’s still in the experiment. Secondly, you don’t really increase n, because you’re only
increasing n of the sort that you’re getting more, measurements of, you’re not
learning about other scenarios that you might care about. Now, here’s the problem to me is that, there’s a distinction
between statistics between evidence and truth. And I think a lot of people
feel that if they know the truth, well, let’s
say if you if you know, the truth is can motivate you
to go beyond the evidence. And when there’s the expectation, I would say right now in science, there’s the expectation that a good study a high quality study
should yield statistically significant results, or the
equivalent near certainty using some other method that the problem is not with a p value or
statistical significance, the problem is with the
expectation of near certainty. Now, when you have that expectation, first, there’s a pressure
to cheat in some way to produce the certainty
that’s expected of you. And of course, there’s a selection bias that the cheaters can prosper. And conversely, the
expectation of certainty can lead uncertain results
to be taken as zero. And again, if you look
at statistics books, including my own, we go
through lots of examples where Oh, we had a
question, we gather data and here’s the result and
95% interval excludes zero, there’s the false positives
have gone away, whatever it is. So the expectation is, that’s how science is supposed to proceed, right. And that’s not true in
other areas of endeavor. If you do sports, I need
hardly to remind people in this area of the country
that you can lose sometimes. And that’s okay, sports
would be pretty boring if you want all the
time, wouldn’t it right. And so the expectation
is that you do better. But in science, often,
the expectation is that, you win every time or more suddenly, it goes like this, somebody does a study, they think they have 80% power, maybe a third of the time they succeed. And so they feel very
virtuous, that they’ve, they’ve taken the hit,
they’ve, they’ve taken the hit two out of three times. But actually, their
studies have 5.1% power and almost everything
they’re seeing is noise, but they feel like they’ve
worked so hard for the successes, they don’t want to let go of them. So beyond that last
lots of overconfidence, followed by embarrassing scandals, can rightly reduce our overall
respect for data science and data engineering. In summary, I think the
algorithmic bias that concerns me, is not so much a bias in an algorithm so much as the social bias
resulting from the demand for an expectation of certainty. – I like what you said that,
if you believe you know, the truth, then you can
ignore the evidence. That’s the way I guess many world leaders, operate these days and we see the results. Okay, we talked a lot
about algorithmic bias, but there is also the data bias, I mean, data bias versus algorithmic bias, how do they interact? And how do we avoid them? Or compensate for them? Kristian, do you want to? – Yeah, so actually,
I think this ties back to something Alex was
saying a little bit earlier about some of the complex
views around technology, sorry, not technology around
the terminology are using. So I think for many of us in this room, when we hear the word algorithm
and or in the context of, say, predictive modeling,
or something like that, maybe what we’re thinking
of is the set of, computer instructions that
tell us how to estimate the parameters of the model
or something like that. By and large, I don’t think
that there are any cases where someone has sort of
fiddled with that aspect of an algorithm to result
in some sort of bias, we think of I don’t think
there’s some sort of malicious actor out there, or I
haven’t seen examples of, malicious actors out there, who are, you know, card
coding into something like if person equals
minority, then treat badly, or something like that? That’s, not how this is really working. And what that sort of leaves, in terms of the bias is
the other aspect of it, sort of the models that
come out of, you know, taking your algorithm, learning parameters from data and then producing a model. And so coming back to the terminology, I think, in sort of
the common discussions, you’ll be hearing in the news media, often what people mean
when they say algorithm is that they mean the model, they mean sort of fitted parameters, the the way that you say, count up the points on your fingers, that someone gets against them, like two points for
being young, three points for having previous arrests, that sort of will make predictions about whether they will be
re-arrested again, for example. And so there is that sort of divide there. But, you know, my opinion
is that most of the bias that we’re seeing comes from the data, it’s derived from the data,
it’s that the algorithms are learning from the data, these patterns that reflect bias. So I think it sort of, it’s
worth diving into a little bit what we mean when we
say bias in this case, because I think, you know,
there are many statisticians in this room, I think
we’re three out of four, at least on this side of the panel here. And so probably what that means
to many of us is something like selection bias or sampling bias. And I think that is a
legitimate use of the term in this case, and we’re
talking about bias data. So one example I usually go into, to illustrate how that might
lead to a biased algorithm in the sense that we
hear about we’re talking about this and sort of
popular discussions, is it using police data
to make predictions about who will be a criminal
or who will be arrested in the future? The idea being, for example, if you were to have police
data on, say, on, you know, drug crimes, for example, there’s this this reality
that in the United States, drug, crimes have been
disproportionately enforced in communities of color. And then so if you use
that data to learn patterns in that data to then make
predictions about say, where you’ll likely find drug crime or who will likely be arrested
for drug crime in the future, the sort of perpetuate those problems, you’ll be over predict predicting, you’ll be disproportionately predicting in the same communities
where they are the data itself is reflecting this
historical over enforcement. So that’s one aspect, right? That sampling bias the
data is more likely to, to contain drug crimes committed
by people in communities of color than other people, that sort of definitely
what’s going on there. I think another example of
bias we need to think about is, measurement bias, sort of
this types of representation. So where they’re sort of systematically different representations of
one group versus another group, or one individual versus
another individual. And the example I like to go into here is teaching evaluation. So you know, what, what,
what we hope they measure is something like
teaching advocacy, right. But there have been studies
recently that show that even, if an instructor, even
if so it is the case that Woman instructors, at least in these studies,
and Man instructors were equally efficacious
when measured using bit more objective measures. So how the how the students performed at the end of the year on the
exams that were consistent across the different sections, or how long it took
them to return the exam, something like that, you can
really just get an objective measure of their performance. However, when you look in this, the teacher evaluation of
the student evaluations of the teaching, what you’ll
see is that Woman instructors were systematically ranked
lower the Man instructors, even for the same level of performance in terms of the objective measures. So then, if you were to make a model, based on that data, right, we have this sort of
systematic bias against the Woman instructors in the data. It’s not that some thought
that some strip instructors are missing from the data set
in a way that’s not random. It’s not sampling bias,
it’s this measurement bias, where the data itself is
reflecting the biases of the people who are who are evaluating the teachers. And if you were to fit that model, right, if your models any good at all, even if you don’t include
a variable like gender, what will come back is that, you know, conditional on whatever
covariance you’re able to collect about the instructors, you’re
likely to find that the model will recommend or will
predict the Woman instructors will be less efficacious if you are using the teaching evaluations as the measure you’re trying to reproduce. So that’s another example of bias. And it’s something totally
different than the sampling or selection bias in the first case. The last example, I think,
that we need to think about, as well, is basically the
data is measuring reality perfectly, but but the
reality has come about as a product of some sort of inequality. And so this example, I’m gonna
give us a little bit silly, but it caught some attention in the media. And it has to do with what happened, it doesn’t happen anymore, thank goodness. But what happened when you when you Google image search the word CEO. And as you can probably
guess what happened, what came back was a
page full of men, right? Like just sort of what
probably many of us envision when we think of the word CEO. And this reflects the reality that CEOs are more likely to be Man than Woman. Of course, it was an exaggerated
version of that reality, the ratio was more
imbalanced than it appeared in the real world. But it did reflect some sort of underlying inequality in the world. So there’s that issue,
I mean, in this case, it was a little bit more
egregious in the sense that the only Woman that
appeared on the first page was CEO, Barbie. Google has then gone back
to fix that, thank goodness. And so you know that that’s a different representational harm that
we need to talk about. But you know, that that
sort of illustrating this other example I’m
talking about where, there is this sort of
underlying true inequality in the world, the media is
measuring it accurately. And then do we want the models
to reflect that inequality. So in this gets back to, you know, what have been some solutions, for algorithmic solutions to these issues. By and large, I think some of, my favorite solutions have been less, in the vein of saying,
hey, the data is wrong, so much as saying what the
right thing to do is to say, limit the extent to which this algorithm can impact this community
versus this community. So that’s to say, to limit the impact, to limit the rate of false positives and say black versus white
to be the exact same. And our very own, Alex
here has proven that under various circumstances where
the distributions truly are different, in the two groups, you could end up with sort of
counterintuitive situations where you’re doing very well and one predictive measure, that’s very, that intuitively seems very fair. But then you’ll end up with these others, when you look at it through
different lens of fairness some groups say something,
say like those false positive rates should be equal among groups, that that you can’t have
both at the same time. And so I’ll close really
quickly, you know, that that vein has sort
of been building values into the objective function
that’s, that’s used to learn the model that sort of what,
what’s going on, right? You’re saying, okay, we we
value having the false positive rates be the same in this
group versus this group? That’s not really a statistical decision. That’s normative decision. And so you know, I often hear like, why are we building values into models? You know, it’s statistics that science, it should it should be neutral. But it wouldn’t eat just
because you choose some sort of, say, some estimation procedure,
that off the shelf software, doesn’t mean that you’re
not building values into it, you may be building values
that you’re not aware of, but you are building in values, perhaps the values that
an error in this group is the same as an error,
this group, right, but you actually are building in values. So just because it’s
mathematically convenient, or software exists to
fit a model doesn’t mean that there aren’t values
encoded in that, itself. So I think, you know, a lot
of what we’re getting here, is we’re getting away from the idea that the data itself is neutral, and that the decisions about how you model that data are neutral. – Andrew, you wanted to add? – Well, I just love that
point and of course, in social science, we
talked about utilities. And the term utility is funny, because it kind of sounds neutral. But obviously, social scientists
have thought a lot about, different people having
different utilities. And in computer science,
we use loss function, it’s the same issue, so I agree. I also thought like the example
of the CEO is interesting, because it reminds me of
a well known math problem of the probability
matching versus prediction. So if you have a deck of
cards, and it has 90 red cards, what let’s say 70, red
cards and 30, blue cards, then you should, if you want to each time, if you’re supposed to predict
whether a card is red or blue, you should say red, red,
red, red red cassettes, but of course, that’ll
make it the aggregate prediction be wrong. And that indicates an issue
that when we look at our, loss function, or our value function, if you want to be more positive about it, we might want to put a
value on the aggregate. And so it’s a math problem,
which is good, right? I mean, that means that
some of these problems can be addressed, at a
more technical level. If we step back and think
like social scientists and ask the question carefully, then there really is a petition for, a lot of our technical
ideas to make contributions, as long as we’re not too
limited in what we look at. – So why are so many predictions of automated decision systems wrong? I mean, we have it in
the election campaigns we have in recidivism, urban homelessness, loan default, and so on and so forth. Is it only bias? Alex, do you want to say something about that? – Sure, no, but that’s
not the end of my answer. – Good.
– So I think Kristian pointed, to some really important points
that need to be emphasized in any curriculum, or
really any collaboration that tries to speak to policy applications or consequential decisions, which is this idea that
our standard framing of these sorts of prediction
problems which say, predict Y as accurately as possible, according to loss L, is very limited. So I want to draw on two elements here. One is that, first of all,
we don’t expect predictions to be accurate in these contexts, simply because imagine trying
to predict loan default, or who’s going to commit
a crime in the future. And you can dream up
whatever data you want. None of that is actually
going to be sufficient to get anywhere close to perfect accuracy on these prediction tests. These are just inherently
complex phenomena. In technical terms, the base error, given whatever set of
features you want to envision is going to be extremely substantial. But this provides us with an opportunity and I think this is an
opportunity that Leo Breiman would have characterized in
part with the Rashomon effect, which is this idea that
there are many models and considerably different
models that actually differ significantly in their predictions that all achieve essentially
the same accuracy that on most aggregate measures, they will perform very similarly. But as, as far as their implications, and individual classifications go, are going to be very different, right? So we may have one model that’s extremely, that says accurate is another, but where the different
models commit errors or how they commit those
errors are going to be very different in a way that’s not
captured by loss function. And I think they’re in
really lies the opportunity. This is where, as Kristian mentioned, we can actually inject other elements. So there’s the societal values,
into the sort of modeling that we’re doing because to
mathematically equivalent are technically equivalent
objects may nevertheless be perceived differently by users, they may be responded to differently. Due to different psychologies, they may be perceived
differently under the law. And the other thing. So the other thing I’d
like to mention here, is that as we think about sorry… flanked on this, give me just a second Sorry about that awkward. But hey, so the other thing
I want to bring in here, is this idea that accuracy
is actually not the thing, that we’re necessarily striving for. And this is illustrated,
do the following fact, we’ve known through research
and the Decision Sciences, going back to the pioneering
work of Palmilla 1954. And the follow on work of Robyn Dawes, again from the 80s, that even
simple statistical models will outperform humans on simple prediction tasks,
so if you ask doctors, will this patient be readmitted
to hospital next six months, the doctor that treated that
patient or you ask a simple logistic regression model? Well, there have been studies
that showed that doctors have a UCS that are
somewhere between point five and point six on this task, whereas he can perform much
better with statistical model. And yet, if you ask people, even given that sort of
evidence on accuracy, do you want us to just go model making consequential decisions? Or do you want a doctor or
person involved in this decision? overwhelmingly, you will
hear that we want humans involved in the loop. And why is that? Well, what this points to is, again, this disconnect between
the simple prediction tests on which our machine learning
models or statistical models perform extremely well. And the actual objectives that we have. So in some sense, is a disconnect between the decisions that we’re trying
to make and reasonable basis for that decision, and how
much of that is quantifiable? And the sort of tasks that our, machine learning models are solving. And I think if we keep that in mind, we can actually make reasonable progress by for instance, and trying to incorporate some more of these values
that aren’t naturally captured with the sort of tool sets that we might deploy off the shelf. – Have a short comment on that? – I’ll be very brief, my I am on. Yeah, so I think you know,
we have seen all these like, really great advances
using machine learning and statistical, you know,
predictive modeling in areas. Alex pointed to one where
humans aren’t doing so well, maybe maybe models are doing better. But in particular, some
of the most impressive ones have been in cases where, humans are really good at it, right? Like, is this a cat in the image? Or is this a dog in the image like, we can do that really well. But there are some cases where, the sort of irreducible error is large. And I think a lot of
the cases where people are concerned about the
fairness and the ethics fall into that category. So predicting how someone’s
going to behave in the future is one of those categories, I think it’s just
fundamentally very difficult and fundamentally very random. And the other thing we
need to take into account is it’s not just, you know,
whether there’s a cat or dog in the picture, it’s not
going to change what based on what I do about it. But you know, I could
actually intervene in some way to help a person if I say, you know, they’re going to have some
bad outcome medically, there’s different
interventions you could make if we’re talking about, again, sort of the context I usually think about, is recidivism prediction, so
for thinking about whether someone’s going to be re-arrested, again, there are interventions you can take. So the prediction itself
is at the end of the story, it’s what can we do about it? And I think sort of the
final The final thing, since I did say I’d be brief, is I also think that in light of the fact, that in many cases, we
are dealing with models, that aren’t super accurate, right, they’re not getting, you
know, they are making a fair number of errors,
I think it is reasonable to start thinking about situations where technical solutions,
or predictive solutions aren’t what we need to be doing. So having sort of the
technologists take a step back, defer to the policy folks, the
law folks things like that, and see what other
solutions are available. – Joaquin has a comment, but
will then go to questions from the audience. So if you could line up
behind the mics, please. So Joaquin you wanted to say something? – Yeah, I wanted to
comment on on two things. One, I really like what you
what you said, Alex about, there exists multiple implementations of an algorithm that will
achieve the same accuracy. Or even maybe if you care
about procedural consistency, if you care about, you know,
the definition of fairness you care about is, is equal
treatment, as it were, the the outcome differences
of that have different implementations that
have the same accuracy. And that and that, and that
implement equal treatment can vary a lot. And this is something that we actually have seen in practice. And so one of the things that, you know, that we end up doing is, is actually monitoring those
those outcome differences, and then all things being equal, picking the implementation
that that has a smallest sort of a disparity there. And then another another
thing that I think is very important to mention is that and I think it’s pretty clear
in what you guys said earlier is, there are other decisions, you know, such as how your, how
your algorithm is being used in a broader context, that it can also end up impacting, people in very different ways, although, although at the
onset, it’s seemed like, it seemed like they, they
didn’t make any difference, I didn’t explicitly, you
know, have different policies for different for different groups. And so that’s another thing
that I think is important to beyond the algorithm
is sort of, you know, if the data you’re collecting
is a result of some policy or some guidelines or something like that, you also need to explore
the space of guidelines that seemed like they were all reasonable. And then and then test things, you know, and and sort of go measure, you know, what policies and what
guidelines seem like they have smallest impact differences. – Okay, please question? – [Alfred] Alfred Spector from Two Sigma. As this is the first ACM-IMS, computer science and
statistics joint summit on data science, it would seem
that getting the definition of that term, right, actually is important to not just say in a pedantic way, the earliest definitions, I
think that we heard here today, we’re focused primarily on the technology and techniques of computer
science and on statistics, and applied mathematics
operations research and focused on that this panel is thinking about a bunch
of issues relating, perhaps, to ownership and valuation of data, objective functions that we
need to set as we use data for certain kinds of purposes and politics or whatever maybe behavioral
implications on individuals relating to humanist
issues such as free will, I don’t think that came
up, but but it could have, and other topics as well. Do you think we should
explicitly agree as a group that our definition of
data science is holistic, and includes these social
science and humanitarian issues within it? – Who wants to tackle that? – Sure, I’ll make one comment in response. – Alex.
– Which is that? I don’t think that we want to be, kind of imperialistic about it. I don’t think we ever I want
to say that the humanities are now a data scientific problem. That’s really not where we’re going. But I think that we need to
acknowledge as a community that insofar as we are
trying to kind of encroaching on territory, or making
decisions or kind of constructing technological objects
that are having impact or consequences in
domains that are typically the purview of these other fields, that we actually need
to meaningfully engage with those fields and
understand the problems and potentially the problems
that we ourselves are causing, and the extent to which
we can propose meaningful technological solutions as technologists. – Well, yeah, we have two more questions. So I would rather go around
and men may come back, please. – Hi, my name is Yong, I work for the International
Telecommunication Union. Thank you, for the top
panel, really great panel. I know there’s a bit of
discussion about kind of the centers of
excellence is specifically bringing in people from
multiple disciplines together to hopefully come up with a more holistic and common language or
framework for being able to apply data science and
machine learning towards actionable projects at scale. I think my question is more in line with, kind of how that can be
communicated across other sectors or non digital data of
stakeholders who are not, are not necessarily in you know, who are not necessarily up to
date as with the terminology that we’re using right now, specifically, in a way where we can empower
organizations or stakeholders who are data rich, such as, I guess, international organizations
like the Red Cross, or and enable data scarce, non digitally, digitally native stakeholders, like someone who’s this
smallholder farmer, for example. Just curious about it. – Another simple question. (laughing) Who wants a take it? Yeah, Andrew. – I think it would be, I
think it would be great if we can convey uncertainty and variation and I think one of the good
things about machine learning in particular is there seems
to be a real tradition of that. So even when people are bragging, they’ll say, we can care, we
can classify this 70% accuracy or 80% accuracy, which
means you can be wrong, the other 20-30%. So I think that’s great. I think that machine learning is already in the right direction of that. – And you think that the
Red Cross, and people from this kind of organizations
will be able to catch that. – I hope, I mean I just like that, I think it’s better than
saying we’ve proved, we’ve discovered blah, blah, blah, right? We instead of that,
it’s like we can predict with this accuracy. It’s so
much more modest statement. I like it. – Okay, please. – This is M Hamada hajia khalifa, from University of Maryland. First, I want to say
Actually, I checked that CEO, there are a few women, actually, a few women appear there. So just the first thing
is search at Google. And there are a few women appear. So there were just some side comments. And the other thing that
I want to say is that, I mean, here when we talk about ethics, I think there was a
question that raised before but the meaning of the data science. So one thing that I
can define essentially, that data science is just understanding the data in the past. So if you consider this
definition of data science, then sometimes we should
not blame algorithms for doing something wrong, essentially. So if something happened in the past, I mean, the algorithm
may be correctly finds that things essentially, like for example, consider people of color on for example, police investigation
or something like this. I think it’s better,
but it might be better. I mean, consider that algorithm
don’t blame the algorithm. However, I mean, the thing that
we are doing in game theory, try to define some other functions or some other objectives
to the utility functions and then consider a
combination of that such that in the future, this does not happen. That might be like it. Better way to say, maybe more accurate way of essentially considering ethics. – Any a brief comment– – We’re supposed to stop I think. – Sorry? – I think we’re supposed to stop– – I was given one minute.
– Oh, am sorry. – Yeah, so any brief comment, and then we’ll wrap it up. I mean we’re wrapping it
up with any brief comment by any panelists. – I disagree with the
premise that data science is about understanding the past. I think that’s a very
constraining definition. And I think that it is
about understanding the past and that we can predict the future. – Great, okay. I want to thank all the
panelists for discussion. (clapping) – [Announcer] Can you hear me? So lunch is upstairs on the second floor, of the escalators in the atrium am told, and we will resume
promptly at one o’clock. Thank you. (mumbles) (mumbles) – [Suchi] Good afternoon, everybody. Please take a minute to settle in. It is my great honor and
pleasure to introduce a Jeff. So some important bits about Jeff. So as is customary when we
do introductions professor, they say they’re a professor at so and so. So Jeff is an engineer. And he’s an engineer at Google. I’m looking at his bio, and it looks like he did his PhD in University of Washington. It sounds like neither
MIT, Carnegie Mellon, Stanford, or what was the other school they were competing
over, you know, ranking? Sounds like Harvard. All right, Harvard. They sounds like they all
screwed up and wouldn’t take him. And he moved to Google at sort of 99, you know, era, people will making bad decisions, he made a few bad ones, too. And since then, it looks
like he’s been very busy. He’s been working pretty hard. And he is now a senior fellow at Google. He also heads Google’s research, and Google’s health research
in particular, as well. And he’s just done a couple
of projects you might have heard of, so for instance, is the implementation of TensorFlow, MapReduce, Bigtable, Spanner. And to me the most exciting and
impressive thing about Jeff, is that he’s multilingual. He doesn’t just speak in English, he also speaks code and puns. And so the thing I’ve always
wanted to know is how does Jeff spend his time during the week? And what fraction of his
time does he spend on writing in you know, speaking in
English, code and puns? And I’m hoping Jeff will talk
about that during his talk. Thanks, Jeff. (audience clapping) – [Jeff] Very much, can you hear me? Oh, and I need a clicker, great okay. So I’m going to talk to you today about some of the work. we’re going to be turned
on by the mic people there. Yeah there we go. Okay and I’m going to be
presenting work done by lots and lots of people across Google. This is not purely my work. It’s many, many people contributing lots of different things. So first, I’m gonna to touch on the state of machine learning research in the world. There’s this amazing
explosion and interest of doing machine learning research. So archive is a paper pre
print hosting service. And this is a graph
showing you how many papers are posted, tagged with the
machine learning related subcategories in various years, starting about 2009 and the interesting thing is around 2009, kind of our nice Moore’s
Law, computational growth that we’ve been so accustomed to, for the last 40 or 50 years, started to slow down,
and really hasn’t been growing at the same rate. That’s that exponential growth
rate of computational power is shown by that read exponential. But it is now flattened out. But we’ve now instead, we’re
generating machine learning research ideas, instead of
more computational power. And at a faster rate than Moore’s Law, exponential growth rate
has been in the past. So that’s, that’s a lot of
people entering this field, looking to do interesting things across lots of different disciplines. And one of the really exciting
areas about machine learning is a sort of sub area of
it called Deep Learning, which is really the modern reincarnation of artificial neural networks, which have been around
for about 30 or 40 years. In fact, tonight, the Turing
award will be presented to Yoshua Bengio, Geoffrey
Hinton, and Yann Lecun for their contributions in
a very long period of time and making fundamental
ideas that we’re now seeing really work well sort of come together. And what’s really new is, you know, we have a bit of new network architectures and new kinds of techniques. But a lot of the fundamentals
that we’re using, really are algorithms that were developed 30 or 40 years ago, the
are just now starting to really seeing in the
last eight or 10 years as we’ve gotten enough computational power to make them do really impressive things. And one of the key
benefits of deep learning is that unlike a lot of other kinds of machine learning methods, where you sort of hand
engineer a bunch of features, and then learn to the final
weights with which to combine, these are the sort of
handcrafted features. Instead, deep learning and
neural network deep learning, the interlock neural networks
learns both the features and the output predictions
of the model simultaneously. So it can learn from
very raw forms of data. And just some examples, you
can think of neural networks that’s able to learn really,
really complicated functions. And they can act, you know,
this is not y equals x squared. This is, you know, you put in raw pixels, and it can tell you, that’s a leopard, or you can put in audio waveforms, and it can give you a
transcript of what is being said purely by training on examples of audio, and what was said in that audio clip. So give it lots and lots of examples. And it can learn to
generalize to a new audio clip of someone saying
something completely new, it’s never heard before,
how cold is it outside? Hello, how are you? Put in his input and you can spit out, texts in another language. (speaking in foreign language),
purely by training on, French English, French
sentence pairs that are alike, meaning the same thing. And you can do interesting things, not just classify pixels,
but you can actually train these models to do sort
of simple linguistic tasks from images, you know, can a
model write a simple sentence describing that scene, given lots of training
data that form a cheetah lying on top of a car, this
is actually one of my vacation photos, which was pretty cool. So just to give you a
sense of the progress in the last sort of eight or 10 years, it Stanford runs a image
classification challenge every year called the ImageNet challenge. And in 2011, that was the
last time before people started using deep neural
networks for computer vision. And the winner that year got 26% error, the winner of the previous
year got like 28%, so it hadn’t really been decreasing much. And we know human error
on this task is about 5%. It’s basically you get a color image and you have to give one
of 1000 categories to it. And it’s actually reasonably
difficult because there’s, you know, 40, different breeds of dogs, you have to distinguish and so on. So in 2016, the winner got 3% are, right, so we’ve gone from computers
not being able to see very well to now computers, you can actually see. And this has huge
implications in the world. So I’m going to structure
the rest of the talk around this nice Lyft of that… the U.S National Academy of
Engineering published in 2008, which is basically they got together a bunch of people from lots
of different disciplines and thought about what should
we be trying to address across all our all of our disciplines in the next hundred years, expect they waited eight years, so we now have 92 years
to work on these things. And they came up with
this pretty nice list of 14 different things. And, you know, I think
if you read that list, you’re like, if we solve these problems, the world will be a better place, it won’t be quite so hot, people will, you know, be able to learn
more, more effectively, healthcare will be way
better, lots of good things. And the interesting thing is, I think if you look at
machine learning today, it’s gonna be key to solving
a lot of these problems. And so I’m going to structure
the rest of the talk around how I think
machine learning is gonna, help us solve some of
these grand challenges. within Google research,
we’re actually working on all the ones listed in red. But I’m going to focus on
the ones listening boldface. Okay, so restore and
improve urban infrastructure is one of them. And we’re on the cusp of
something pretty amazing, which is autonomous vehicle, right. And if you think about if we really have, autonomous vehicles on the road working, let’s get to really
transform how we think about building cities, we won’t
need parking garages, you’ll just kind of cement
exactly the right size vehicle to come pick you up, you know, one seat vehicle for you. Or maybe you’re at Home Depot. So you need a pickup truck,
autonomous vehicle to come and get your lumber or whatever. And the thing that’s
enabling this is the fact that we can now fuse raw
center data, you know, LIDAR data, camera data, and radar data, other kinds of data and use machine learning to build a model of the world around us. So that we can say, Oh,
well, that’s another car, that’s a pedestrian, that’s a light post, it’s unlikely to move, and
then help make predictions of where what actions we want
to take to safely navigate the world around us. And just to give you a sense, WAYMO, which is Alphabets self
driving car subsidiary, has been running tests in Phoenix Arizona, over the last year with no
safety drivers in the front seat, and real passengers in the backseat. So this is not some distant far off thing. But we’re actually driving
around the roads of Phoenix, timidity or an easier
environment the most, because it’s really hot, there
aren’t that many pedestrians. The other drivers are
kind of slow, never rains. But it is a real urban environment that we’re driving around
in with no safety drivers. So that’s pretty exciting. You know, now that vision works, it’s going to be easier
to make robots do things. And this is just an example. One of the things you obviously
want a robot to be able to do is pick things up
that it’s never seen before. And in 2015, the success
rate on a particular task of picking up an unseen object you’ve never experienced
before was about 65%. And one of the things
we’ve been doing is looking at how we can use multiple
robots collect experience from multiple robots, have them
practice picking things up, and learn from their experience. And using that we were
able to get to a 78% grasping success rate. And then more recently, we’ve been doing more sophisticated algorithms,
refining our approach a bit. And we’re now at 96% grasp success rate for picking up an unseen object. So we’ve gone from, you know,
basically a third of the time, you can’t pick something up to, you know, you can reasonably depend
on be able to at least pick something up and then string together sequences of actions to
actually build robots that can operate in messy environments. Another thing we’ve been
doing is how do you actually get robots to learn new
skills, rather than hand coding control algorithms, which has been the traditional
approach in robotics for many, many years. This is one of our AI
residents, demonstrating things to the robot. They also do machine learning
code, but at the moment, he’s giving us video data to emulate. And you can see the simulated
robot is trying to emulate what it seeing in the pixels. And on the right hand side,
it’s pretty interesting. You know, if you want to
teach a robot to pour, that’s a really complicated
hand coded control algorithm. And it would probably not work very well. But what we’ve done here is given, the machine learning system
and the robot, about 15, short video clips of people pouring into different kinds of
vessels from different angles. And then the robot gets
15 trials to try to train itself to pour given the video data, and its own sort of reward experience. And so with that, it’s able
to get to learn to pour. And I would say maybe like
a four year old level, not an eight year old level. But it’s pretty good. If you think about
what’s required to teach a robot pour, that’s promising. And there’s all kinds of training data, like people cracking eggs on YouTube and you can imagine training
on lots of kinds of data. Okay, another one was
advanced health informatics. And this is an area where I
think there’s just tremendous opportunity to make healthcare decisions be better in collaboration
with healthcare professionals. And one of the things that
I think is really exciting is can we use machine learning to bring, expert decision making and
spread it around to many, many parts of the world that
don’t have a high concentration of experts, and I’ll use diabetic
retinopathy as one example of a particular task. diabetic retinopathy
is the fastest growing cause of preventable
blindness in the world. Basically, if you’re at risk for diabetes, you should be screened
for diabetic retinopathy on a yearly basis. And so there’s 400 million
people at risk for this, who really should be screened every year. And it’s very specialized, you need basically an ophthalmology degree a basic MD, can’t do it. And in India, for example,
there’s a shortage of more than 120,000 eye doctors. And the real tragedy is this
disease is very preventable, if it’s caught in time. So you, but if you don’t catch it in time, you’ll suffer fuller partial vision loss, before sort of treatment is administered. And in India, 45% of patients lose vision or some full or partial vision
before they’re diagnosed. So it turns out now the
computer vision works to detect Doberman, shepherds, a German shepherds and Dobermans and all kinds of other cars. It also works for us
assessing a retinal image into one of five categories, you know, no diabetic or anomaly to severe. And turns out, you can
just adapt the basic, neural network architecture we’re using for general image models, and
training on retinal images. And if you do that, you get
something that is basically on par, or perhaps slightly
better than ophthalmologist said, diagnosing and assessing
diabetic retinopathy. So this was a paper
published at the end of 2016, which is pretty cool. But what you actually
want is even better care. And so it turns out if
you get the data labeled by retinal specialist who
have much more training in diagnosing retinal diseases and you use what’s called
an adjudicated protocol, so you essentially get three
retinal specialist in a room. And you ask him, what do you
collectively think about this, rather than that getting
three independent opinions, that turns out to allow you to get, to a much higher level of accuracy, you want to be kind of all the way up, in the upper left corner here. And so we’re basically on par
with retinal specialist now rather than general ophthalmologist, that’s kind of the gold standard of care. That’s what you want everywhere. And one of the things that the community has been working hard on is how
can we explain the decisions these kinds of algorithms
and systems are making to the users of the system and this case, the ophthalmologist and so
one thing an ophthalmologist might want to do is show
me where so given this, the the model is output,
moderate diabetic retinopathy, you know, level three output, but we can actually use this
to show where in the image, the system is focusing on in
order to make that assessment. And that’s a really helpful thing. In general interpreter
ability of these models is a really important thing
as we sort of pair humans with machine learning systems in order to make better decisions. Okay, the 14th of the
categories was engineer the tools of scientific discovery, which I think was just like a grab that, you think they throw in, ’cause lots of stuff like that over there. It’s a little less
focused than the others. But it’s clear that if machine learning is gonna be at the core
of a lot of the ways in which we do science, then we’re going to want good
tools for machine learning. And so one of the things
we’ve been doing in our group is putting together software that enables us to do machine learning research, and to try out things quickly, express machine learning ideas. And this is kind of the
second generation of system that we’ve been putting together
for deep learning research. And we decided we would open source these second generation
system, just called TensorFlow, and we open source it at the end of 2015. And it’s had a pretty good
adoption curve, which is good. This is showing GitHub stars. So when you have a GitHub
as a place for the lots of, open source software is hosted. And people can start
repositories are interested in sort of following
or getting updates on. And that’s TensorFlow in the
green line there compared to a bunch of other open source
machine learning packages that were all created at different times. So we’ve had pretty
good adoption, you know, lots of lots of metrics. 50 million downloads for like a fairly obscure machine learning
kind of programming environment software,
is pretty interesting. One of the things I’m
really excited about, the reason we open sourced it is, the 1900 contributors,
about a third of those, are people at Google. And the rest are people from
lots of other big companies, lots of small companies,
people, independent students making contributions to improve
this core set of systems that were that were
working with, that’s cool. And people have done interesting things, with it, so like there’s a
company in the Netherlands, that is building fitness centers for cows, so that you can monitor your dairy herd and understand cow number
seven is not feeling so well today. And they use TensorFlow to
analyze the sensor data. A team between Penn State University and the International Institute
of Tropical Agriculture in Tanzania, built a
deep learning based model using TensorFlow that can run
on on device in the middle of a cassava field to help
you assess what disease your cassava plant has, and
how you should treat it. And this is a good example
of where machine learning really wants to run in lots
of places in the world, in large data centers,
but also on a on a phone in the middle of cassava field
with no network connectivity. So it’s really important that we build the right hardware and software
tools and enable models to run on that and also
much larger systems. Okay, one of the things
I’m excited about is, how can we sort of automate
the process of creating successful machine learning
solutions for problems. And one of the things
that our research group has been working on is a
technique called AotoML, essentially learning to
learn to solve problems. And I’ll describe a few
pieces of research here, and then sort of show how it works. So currently, we’re you want to solve a machine learning problem, you get someone with
machine learning expertise, which is hard to find,
generally, and you have some data about a problem you want to solve. And then you have some
computational resources, GPU card on your desktop, or maybe you’re using cloud, so you use a bunch of
GPUs or GPUs or whatever. And then the machine learning
expert kind of spurs this, all together a few times and hopefully comes up with a solution. So what if we could turn
this into more like data plus computation gives you I
actually high quality solution for problems you care about? That would be amazing, right? Because the problem today
is that there are literally 10s of millions of organizations that have machine learning
problems in the world. Most of them don’t even realize they have, a machine learning problem, let alone have machine
learning experts on staff. And there’s only 10s or
maybe hundreds of thousands of people in the world who have the skills to sort of know what to do to train the machine learning model. That number is growing by the way, if you look at enrollments
and machine learning courses at every university, they’re skyrocketing, because sort of like the the
initial research paper graph, but it’s still the case
that it’s going to take us, a long time to get enough
people trained to solve all the problems we should be solving. So what can we do? So one technique is called neural architecture search. So when you sit down as
a machine learning expert to solve a problem with deep
learning, you make a lot of decisions about what kind
of model you’re going to train. Is it like a 13 layer model? Is it going to use three by
three conversational filters at layer seven or five by five? What learning rates of the US? What struct? How should the layers
be connected together? Is it just a linear thing? Are there skip connections? Lots of decisions like that. So what we’re gonna do is we’re gonna, have a model generating model, and it’s gonna generate
a model description. And then we’re gonna try
that model on the problem we actually care about. And we’re gonna do this, we’re
gonna generate 10 models, train them all, and then
use the accuracy of, those generated models to then steer the model generating
model away from solutions that didn’t work very
well and towards solutions that work better. Okay, so basically,
that we’re gonna iterate a lot of times, we’re gonna train, generate models, train
them, generate models, train them, and eventually
we’re gonna get higher and higher accuracy models. And that’s great, they
look a little weird, sometimes, this is not what a human might sit down and design. But this turns out to be
a highly accurate model for a particular problem, that’s okay. And if you look at this, this graph shows you for
image recognition, actually, the ImageNet challenge problem. The black dots on the
dotted line at the bottom are sort of each a really
interesting model that was created by a team of machine learning
and computer vision experts best in the world advanced
the state of the art at the time, it was
published over the last, you know, these are all sort of developed in the last four or five
years at various times. That’s kind of the frontier
of computational costs in the x axis and accuracy on the y axis, generally, you get more accuracy if you spend more computing. And so in 2017, we used
a set of techniques and got results looking like
the blue dotted line there. And in 2019, we’ve been able
to significantly improve that. And the interesting thing
is like this approach is better than all the
machine learning experts hand designing different kinds of models. And this is true for object detection. So again, those bottom lines are various, hand engineered architectures. And the top line, there is, an AutoML approach
designing architectures. Same thing in language translation. And again, the metrics vary,
but you want to generally be up into the left. And same thing in tabular data, we starting to get really
good results on let’s say, you just have some relational database, and you want to predict one
thing from all the other things. Turns out there are, there’s
a site called Kagggle, which hosts a bunch of
these kinds of competitions, to see how well things do. The AutoML system that we’ve
put together is in the blue, there and it’s basically better
than for the three problems on the right, better than 95 to 98% of all the Kagggle competitors. One problem, it’s sort of in the 50-45th percentile or something. And they recently had
a contest for one day, given a completely new
problem 76 teams 75 teams, I guess, plus our AutoML software, and the AutoML system place Second, it was in first until
the last four minutes and then are cooked and Mark
entered one more solution. But it’s pretty good and
turns out your ensemble, the market and Mark solution
and the Google AutoML solution. And it beats both of them, so that’s cool. So this is across a lot of domains, we can now create systems
that are working much better than the best human crafted solutions. So the other thing that’s
interesting is we need more computational power. Obviously, if we’re
gonna start training lots and lots of models for
each problem we care about in an automated way, we would like computation of this form, to be as cheap as possible. And the really interesting
thing about these approaches is that deep learning really, has different kind of computational style than traditional twisty
C++ code, you run on CPUs. And the two properties it has, are that all the algorithms
that I’ve shown you, all the system, machine learning models I’ve shown you are all very
tolerant of reduce precision. If you do computation
to kind of one decimal, digital precision, that’s fine. And also, all the things they
do are different compositions of a handful of very specific operations, essentially linear algebra
kinds of operations, matrix multipliers, vector
dot products and so on. So if you can design
computers, that are really, really good at reducing
precision, linear algebra and nothing else, that’s perfectly fine. And you’ll be able to
specialize those computers so that they take advantage
of just doing those things and not everything else. So we’ve been doing this for a while, the first generation of
this system was designed for influence once you
already have a trained model, how do you apply it and
get predictions from it for different problems, you know, if you have an image recognition model, feed in an image, and it
gives you the predictions by running the model. And this is kind of the
first thing that we deployed in our data centers, we started designing it into 2013 or so, it’s been in production
news for 20 since 2015, because designing chips takes a while. And it’s used on every search query. It’s used for all of our
neural machine translations used for speech recognition,
image recognition, was used in the AlphaGo
match that our DeepMind colleagues played against
Lee Sedol in Korea, and Ke Jie in China. And then we’ve been focused since then, on designing chips for influence. So for influence and training. So the second generation
on the third generation do both influence and training. And they’re designed to
be connected together into large configurations called pods. And so the latest generation TPUv3 pod is 1024 of these chips. One of those is more than
100 pedal flops of compute, which is you know,
pretty, pretty reasonable, it’s reduced precision. So it’s not directly comparable
to top supercomputer, sort of teraflops, but top supercomputer I think it’s like, 150 petaflops or something. And this is one one system,
we have lots of these systems. And with this system, you
can train an image net model from scratch in two minutes,
less than two minutes. Whereas that used to take several weeks. And you can imagine, this
just generally speeds up your ability to do research. If you can get an answer to the question you had in two minutes,
instead of three weeks, you just do a different kind of science. Okay, so what do we want? I think we’re training models
in kind of a wrong way. Like we generally are training
a model for a particular task from scratch or sometimes
we’re doing transfer learning or multitask learning of a
handful of related things, but not every every
problem we want to solve. So I think we want a really large model that can do many, many things, but sparsely activated so that they maybe, it’s on 1000 computers, but
we only touch 50 of them, to do one particular thing. And we touch a different 50 of
them to do a different thing. And we want to solve literally
hundreds of thousands or millions of tasks with a single model. And we want to dynamically learn pathways through this large model that makes sense for different tasks, possibly
sharing pieces of computation across different tasks, when it makes sense, but
having independent pieces where that makes sense as well. And the architecture should adapt kind of like with neural
architecture search, adding new tasks will leverage
skill and representation. So a cartoon diagram to conclude, you know, let’s say we have this system that can already do a
lot of different things. And we add a new thing, oh, and a components made up of components, which are some piece of
a machine learning model with state and operations. In a neural network, you know, operations might be forward and backward. So let’s add a new task. And now let’s imagine
running some sort of pathway like reinforcement learning search, to find something that
gets us into a good state for this new task. So great, so now we have
something that is leverage some of the representations
learn for other tasks. Now, let’s say we want
to improve that task, let’s add a component and
extend the model a bit, for this task. And now we have hopefully
a higher accuracy system. And that new component can also
be used across other tasks. And each component might be running some sort of little architecture search to adapt to the kind of data
that’s being sent to it, okay, I’m gonna close with thoughtful
use of AI in society, I think, if you look at
the breadth of things, that the systems are touching, it’s really important to think about, how we want machine learning
and AI to be applied. And one of the things
that I’m proud of is that, we as a company have put
together a set of principles. And we’ve published them in
public about how we think about thinking about new
applications and machine learning, should we do use machine
learning for this problem or that problem? And there’s a lot of real issues here, like in particular, the second one, avoid creating a reinforcing unfair bias. If you think about what
machine learning systems do, they learn from data. And if you’re learning
from the world as it is, rather than the world
as you’d like it to be, you can perpetuate biases,
or even accelerate them, by being able to do things
in a more automated way. So just to get a sense, what
we’re trying to do is apply the best principles we know
how to but also advanced the state of the art in, for example, how do we eliminate
bias from these models. And just to give you an example, we’ve been publishing
lots of papers on sort, of fairness and bias in machine learning. So this is an active area of research. We want to use whatever the
best known techniques are, but also push those beyond. So with that, I hope I’ve partially, convinced you that deep neural nets are doing really interesting things across a broad range of fields. And they’re gonna affect
not just computer science, but like nearly every
discipline, you know, think about the fact
that computers can see that has enormous
ramifications in and of itself, across many disciplines. So thank you. (audience clapping) – [Man] One question. – [Jeff] One question,
great I will come back up. Here I am. – [Bin] Can I do the question? Hi very nice talk. So one question is you hinted
to maybe energy consumption. So for the current self driving cars, how much energy you know, we
spend on top of the usual cars, we drive a day like, – so the energy consumption
of autonomous vehicles is actually not all that
different from a traditional vehicle of really largely depends on the drive train mechanism that is used. And it’s not that different,
whether it’s autonomous or not, I think there will
be energy efficiencies from the fact that it is autonomous. And if you were imagining
controlling a large fleet of these, you can imagine routing things more efficiently so that, you know, you need fewer cars on the road, perhaps and that cars appear in the
right place at the right time with the right size car. But that’s sort of
farther off kind of thing, once we have a lot of
these actually on the road. Does that make sense? – [Bin] Yeah, under the deep learning how do you prevent adversity
attacks in your paradigm? – Yeah, that’s another
area of important research like safety and adversarial
attacks on these is… I didn’t talk about it in this talk. But there’s a lot of threat
of research in adversarial techniques to prevent sort of images, that look like a stop sign, from being interpreted as a 45 mile an hour
speed limit sign (laughs). Thank you. (audience clapping) – [Kristian] Hi, it’s
my pleasure to introduce Daphne Koller, she is the
CEO and founder of Insitro, which is an machine learning based startup working in drug development. She’s also the co founder of Coursera, which is a massive open
online course system, which I think many of us have heard of. She was the Rajiv Motwani,
Professor of Computer Science at Stanford, and she’s now adjunct there. And she was also the
chief computing officer at Calico, which is an Alphabet company working in healthcare. So those are sort of the
the posts she’s held. She also has a lot of
other feathers in her cap that I’d like to mention
and I only have two minutes, so I can’t go through all of them. I did try to scrub her
the bio that she provided, and also her Wikipedia page, which in itself is a feather in her cap that she has a Wikipedia page. But some more notable
honors she’s received is that she was a MacArthur
Foundation fellow. She won the ACM prize in computing. She’s a member of the National
Academy of Engineering, and a fellow of the American
Academy of Arts and Sciences. And if you hadn’t heard of
her from all of these things, you may have seen her in the media, because she was named
one of Time magazine’s 100 Most Influential People in 2012 and one of Newsweek’s 10,
most important people in 2010. So needless to say, I really look forward to hearing what she has to say thanks. (audience clapping) – Thank you, and thank
you for inviting me here. So for most of my career, with the exception of the
detour through Coursera, I’ve been working at the
boundary of disciplines, machine learning and the life sciences. And I use the word boundary
rather than intersection judiciously, because much
as I would have wished it, to be otherwise, there hasn’t really been that much of an intersection between those two
disciplines, until recently. And what I hope to convince you today is, that that is changing. And that that’s actually going
to be a new era in science that I’m personally very
excited to be part of. So I’m going to start with
what were the problems we’re setting out to solve in, this next chapter of my
life, which is the problem of therapeutics or discovering the drugs, that make people better. And here’s there’s an area where it’s both good news and bad news. The good news is that when
you look at the history of, therapeutics, in the last 50 or so years, there’s been tremendous
amounts of progress with new waves of innovation, that have been making a
difference in treating new categories of disease. So starting from vaccines is one example. biologic therapeutics,
which came about in the 90s, that have allowed us to
address certain types of cancer effectively, certain kinds of, autoimmune diseases effectively. So of the current 10, top selling drugs. Eight are actually biologic therapeutics, which is quite remarkable. Cancer immunotherapy is, of course, have now gotten to the
point where some people with metastatic disease, not
as many as we would like, but still are now able
to have a complete cure from their disease, which is unheard of, when you have a few years ago. Targeted therapies for
genetically driven diseases is one of my favorites,
because it’s one of the first real advances that have been driven by the Human Genome Project. So cystic fibrosis is a
great example of that, we’re by understanding
specific genetic lesions that occur in the gene
CFtr, and targeting drugs, specifically to those lesions, we’re now able to give a
effectively normal life to about 90% of cystic fibrosis patients, whereas a few years ago,
this would have been a death sentence. And then, of course, new
generation of therapeutic modalities that are emerging gene therapy, cell therapies, and many more, that’s great and that’s the good news. The bad news is this interesting graph, which for those of us who
grew riding the tidal wave of Moore’s law, it’s a
kind of ironic graph, it’s come to be called you Eroom’s Law, which is, you know, this
is the inverse of Moore. Because this is a graph
on an exponential scale, where the x axis is the
number of drugs approved per billion dollars of U.S R&D spending, and the x axis is time. And you can see that
consistently for over 50 years, there has been fewer and
fewer drugs approve per dollar amount over time, and that doesn’t seem to be shifting, despite all of this amazing progress that I’ve shown on the previous slide. So why is that? It is not the case that a drug that used to cost 100 million
dollars developed start to finish now cost two and
a half billion dollars, which is, by the way, the amount now quoted is closer to three. So what is going on here? What’s going on is that
we have a lot more stuff that we try and early stages of, the drug development pipeline, but over time falls by the
wayside, because of failures that happen at different stages. And each of those stages is really expensive and complicated, drug development is really hard. And so by the time you get to
the final regulatory approval of a single compound,
it carries on its back the costs of 10s of thousands of failures that have cost many millions of dollars. So when you think about
what that really means is that this is ultimately
a problem of prediction. If we could predict earlier on which, hypotheses we scientific
hypotheses are likely to work, which drugs are likely to succeed, we would be able to
eliminate a good portion of those failures so
that each successful drug might carry less weight on its back, be cheaper to develop. Now, machine learning
has gotten really good at making predictions. We heard about that from Jeff. So the question is, can
we use machine learning to address multiple prediction
problems that come up in the drug development process, so that we’re able to try
fewer things that fail and have a higher success rate. And so that’s really the problem, that that we’re looking to solve. Now, there’s many different
problems that one can focus on, in this space, I’m going
to focus on the one, that we’re starting out
with, but there are others, that one can tackle over time. But the fundamental
problem in drug design is, if I make this intervention, this therapeutic intervention in a person, what is it likely to do? What phenotypic effect
is it likely to have? So that’s the problem
that we set out to solve and you ask yourself, Well, obviously, this is we are being solved
and drug design today. So what do people typically do? What they typically do
is use models systems, mouse being the one that’s
most ubiquitously used. And so you put a drug in a Mouse, you see what it does to the
Mouse, you say “oh, good” “this is what I infer from that,” “what is likely to do to, a person.” The fundamental problem with that, and I know this is gonna
come as a shock to you is that Mice are not people. And furthermore, most of the
diseases that today serve as an unsignificant unmet need
are ones that Mice do not get. And so people have gotten
really creative and using Mice to build model systems for
disease, in ways that are, let’s say, creative but
might not be affective. So I’ll give you one
of my favorite examples like the talk for a lot longer on this, the Mouse model for depression is called the forms the force swim test, you take the Mouse, you put it in a pool, you let it swim until it drowns. If it swims longer, it’s less depressed. (laughing) Okay, now, then you
wonder why there have been so few psychiatric drugs
approved over the last few years. By the way, the ones
for neuro degeneration are not much better, but we
can talk about that later. So the fundamental thing
that we would like to do, is to use humans as a model
for humans rather than Mice. Now, of course, you don’t
get to experiment in humans. That’s not ethical, it’s not practical. But fortunately, there
are other data sources that we can bring to
bear in order to produce the kind of training data that we need to drive machine learning models. So one of those is human genetics, each of us is actually
an experiment of nature, in which nature has perturbed the genetics of a particular person, and we can see the effects of phenotype and what it does to disease. And I’m sure enough,
there has been evidence, quite a bit that has shown that drugs that have genetic validation, that is where we see a
phenotypic defect in human, using genetic data that
is commensurate with, what we will do like to see from the drug, the targets that gene, those drugs tend to be more successful. So this is one grant one way
of showing that there’s others. This is the percent of
drugs in the pipeline at different stages,
ranging from preclinical all the way to prove that
have genetic validation. And when you see that, as you
move further in the pipeline, the ones that have genetic
validation are less likely to a trait, so you have a
higher percentage of drugs with genetic validation
on the approved drugs than you do at the earlier stages, suggesting that those
are ones that are likely to more successful, that’s great. And sure enough, we now have
more and more genetic data that is available to us on people. So this is actually a Moore’s Law, not in the Eroom’s Law graph. So the x axis again, is an
exponential scale, Y sorry, the y axis exponential
scale, x axis is time. This is the number of
human genome sequence, since the very first human genome 2001. And you can see that that
graph is not only growing exponentially, it’s
growing exponentially twice as fast as Moore’s law. So if you believe that this
historical trends will continue, then by 2025, we will have
approximately two billion human genome sequence,
that’s a lot of data. So that’s great but genotypes
on their own are not enough, you need phenotypes associated
with phenotypes or traits. And this is a place where
progress is also growing, it’s not quite as quite as fast. But there are resources like this, this is one of my favorites is the UK Biobank data set in the UK, and its measures for 500,000
people very detailed phenotypes along with the genotype. So things like bio biomarkers,
both blood and urinary, behavioral data, physical
traits and so on, as well as imaging of whole body brains, and so on and so forth. And now you can really
start to look at these data and associate them with genotypes. By the way, this is the one
of the best of the resources, but there’s other ones that are emerging, including one here in
the U.S called All of Us, which just kicking off, so this is great. And now you can say,
Okay, we have genetics, we have phenotype. So we can now do an association. This is usually done using
fairly simple statistical methodologies like linear regression. And you can now do these associations. And over the past 10 years or so, since this first kicked
off in, in around 2008, there has been a growing
number of genetic associations that have been uncovered, ranging from things that
are molecular in nature, like LDL cholesterol, to
things that are diseases, like Crohn’s disease or Alzheimer’s, even to things that are
not even diseases like, educational attainment, turns
out that have pretty strong genetic basis, so this is great. And you think, Well, okay, we’re done, we have the genotype phenotype, nothing. The problem is that this
is kind of challenging, because when you think about
the genes that are mapped to phenotypes, we don’t
really know what they mean. And they’re often quite
small in their effects. So the typical effect size. So first of all, look at these, and you can see this
many of these diseases, there’s hundreds of variants,
hundred 200 variants that have been or low
side, genetic, low side, that have been associated, each of those has multiple
genes in that region. And when you start looking at
the actual genetic changes, very few of them are ones
that scream out at you, I know what this means
it’s in a coding region, it’s a loss of function. A lot of these are regions of the genome that are the dark matter, where we don’t even know what that means. And so and the effect size
that they have on the phenotype is often quite small, you know, an odds ratio of 1.03 is typical. So how do you interpret
those and use them, as a way of discovering drugs? And sure enough, it turns
out to be a challenge. So if you look at this
example of this graph, again, where you think about how,
what is the extent to which, you get benefit from having
genetic substantiation for a drug, and this is the
different way of visualizing the graph that I showed before,
you can see that overall, you’re twice as likely if
you look at the bottom… At the right most graph, you’re twice as likely to go
from preclinical to approval if you have some genetic support. But really, the biggest gains are got… Are paying for what’s
called Mendelian traits, which are an OMIM data set,
which are very high penetrates, coding, the level changes
are we know exactly, what they mean, when you
start looking at the support from Genome-wide association, that benefit is considerably lower. So the question that we
need to ask ourselves, is can we leverage genetics much better? So one question that you
might ask yourself is, well, maybe all those other
variants genuinely don’t matter, and there’s nothing to leverage. So beautiful paper from
psychotherapists group at Harvard, that came out at the
end of last year shows, that doesn’t seem to be the case. So he defined what’s called
apologetic risk score. I mean, he didn’t define
it, it was there before. But in this paper, he constructed
a polygenic risk score that doesn’t just look even
at the top 200 or so variants looks at two million
variants across the genome, and puts them all
together in a linear score called a polygenic risks score, that gives you this beautiful
Gaussian distribution of risk. And it turns out that as you
move to the right hand tail of this distribution to people
who are at greater risk, you can see that there’s
8% of the population, this is for cardiovascular
disease, that has a three fold risk of disease, if you
have enough small variants that contribute negatively
to your genetics, then you have a three
fold risk of disease, four fold, five fold. And as you move to the right,
it can be even as large as, eight or 10 fold. So there is a lot of signal. And by the way, eight or 10
fold is as much as you get from an Mendelian high penetrates variant. So there is signal in this distribution it’s just so we don’t
know how to interpret it. So how do we plan to go about that? Well, fortunately, by all
just have provided us, with yet another valuable
tool, which is the ability to, generate very large
amounts of biological data that are disease relevant, and I’m gonna talk about that next. So case study, just to
illustrate what we mean by this. This is an image of a stained
cell culture of neurons, the blue, the green are nuclei,
sorry, the blue are nuclei, the Green are synapses, and
this is what it looks like. And now we’re going to look
at this for in the context of, a particular genetic variants
that we do not understand. This is a region in
chromosome 16, called 16p11.2 and for whatever reason, that region tends to have
a lot of genetic changes in the population, there’s
people in in whom it’s deleted, there’s people who it’s duplicated. So if this is what the
wildtype, and in turns out, sorry, that in the deletion,
the autism penetrates to 75%. That is, for people in who was deleted, 75% of them have Autism, in the people in who it’s duplicated, 40% have schizophrenia. There’s around 25 genes in that region. And we have no idea which of
them causes these phenotypes, if it’s the same gene for both. If it’s more than one gene,
none of that is known, only those numbers are known. So in this 2017, paper, people started to use
a really wonderful tool for getting at the biology of this, which is to pick cells at
the follicles, skin cells, from people who have wildtype, or one or the other of these variants, reprogram them into stem cell, which you can now do because
of the great work of Yamanaka, for which she got the Nobel Prize and then differentiate them into neurons. So we can do that for
any single one of you. We can take skin cells,
reprogram them into stem cells, regenerate neurons, or cardiomyocytes, or hepatocytes or whatever. Well, turns out that if you do that, for people with a deletion, you get an incredible extra
of synaptic embolization. And for people with the duplication, you get a significant depletion of that synaptic embolization. And you can see that by eye
and you can also quantify that to a certain extent
by looking at these images. And you can see that the Soma area changes that then Dendrite length
changes, this is great. And it gives you the first inkling for the mechanistic process
the generated the disease. Now, the thing is that to do
this work takes three years. And, and it was done in about
three patients of each type. So it’s an incredible
amount of work to do this. But what if we do this differently? So biologists have now
in the last few years created these remarkable tools. The first one is the one
that I just told you, which is the ability to
create cellular models that are disease relevant
up until five years ago, most biological experiments
were done in cancer cells. Now, not only are these cancer cells, so they’re not really normal. They’ve been passage so
many times that by the time you get to look at them, they have eight copies of one chromosome, zero copies of another because these cells are genetically unstable. Today, we can do this for
cells that really resemble normal human biology. We can perturb them
genetically using CRISPR, to do very fine grained
changes that might mimic the genetics of disease so that we can now start
interpreting those variants. And we can phenotype
those cells using imaging just like the ones I showed you before, to really create give us an
inkling of what’s going on, at the cellular level for patients that have different
variants of the disease. And, importantly, putting in
another piece of progress. Because of high throughput automation and Microfluidics, we
no longer have to spend three postdocs pipetting for three years, in order to get data,
like the ones that we saw, we can really generate incredible
amounts of data at scale, to allow us to have this
level of understanding. And that, and so you can
really build at this point, a Data Factory, if you will, that allows you to
generate this kind of data in a remarkably scalable way. So that, of course gives us way more data than people can interpret. And that’s where machine
learning comes in, is to interpret data that looks like this. So that brings me to machine learning. and I’m gonna actually
skip past these slides, because Jeff did such a great
job of laying the ground for me about how much
progress machine learning has made since 2005, where we could barely tell
what was in this image to 2012, where we kind of were
guessing sort of Okay, to 2017, who were able to
capture these at beyond human level performance. And this is again, thanks to Jeff, I don’t need to take you
through the fact that, the key benefits here is
that people no longer need to construct features. The tell the machine,
what’s important to look for in these images or not, he just give them the raw pixels is data. Now, that was important
when you’re looking at Cats and Dogs and Bears. It’s even more important
when you are looking at, synapses and neurons. Because whereas people are really good at looking at Cats and Dogs and Bears, they’re really not so good at looking at, images of neurons and synapses. And so their ability to
construct meaningful features in that domain is considerably lower, than in the case of, natural images. Whereas we’re even their
machines outperform humans. So certainly, when you
start to look at synapses, you would imagine that machines will considerably outperform humans. So one thing that Jeff didn’t
touch on and is important to what we’re doing is the understanding that potentially gives us about the space the underlying space of
that we’re looking at. So when you construct these
neural network models, and you look at the next to last layer of the neural network, it gives you a embedding of what is a very high dimensional space into a
much lower dimensional space. So he think for just as an
illustration about the space of all images that says,
let’s say for simplicity, 1000 by 1000, let’s say for
simplicity is grayscale, these images live in a
million dimensional space. Most of that spaces filled
with white noise images, that don’t mean anything, the
space of real natural images sit in a much lower dimensional
manifold within that space. By training the neural network
to a particular objective, you tell it, how these images
in that low dimensional manifold relate to each other, and how images that you would
like to be labeled similarly, need to be next to each
other in that space. So when you’re labeling, when you’re forcing the
network to label images with natural categories,
you’re going to be forced to put these two images
of trucks that visually are very distinct in terms of
color, and pose and texture, you’re forcing the network
to learn that they need to be next to each other. Whereas other words, things
that are somewhat related, but not quite the same,
like tractors and cars are going to be relatively
adjacent and Cats and Dogs are going to be sitting
somewhere else in that manifold. That manifold gives you
a if you look at it, a deep understanding of the… Of the of the space, and
therefore can tell you, something about the
space we’re looking at, and I’ll come back to that in a minute. Okay, so now we’re going
to take machine learning and we’re going to see
how you might apply this, back in the domain that we talked about. And so machine learning
applied to images of cells is not a new thing. In fact, one of the papers
that I’m actually still the proudest of that I did
just before I left Stanford, in 2011, was the application
of machine learning to digital pathology. This was back in 2011, free
the deep learning revolution. So we were using traditional
machine learning. But fortunately, I had a
pathologist as a PhD student who was able to construct
meaningful features and was also open minded and said, “we’re going to construct
many, many, many features” ” that pathologists have
never ever looked at.” And that turned out to
be highly significant. So when we looked at these tumor
images under the microscope with these hundreds of
features that pathologist had never looked at, we were able via machine
learning, simple machine learning to get performance that was better than, the average pathologist. So what you see here is what’s called the Capital Meier Survival
Curve, x axis is time y axis is patients were still surviving at that point in time. That is a curve that unfortunately
goes down into the right. And what you see is the stratification between high risk patients at the bottom and low risk patients at the top that the machine learning
was able to come up with and you see a pretty good stratification that was at the time better
than your average pathologists. What was equally important,
though, is that the features that the machine learning
was using were exactly the ones that pathologist
had never ever looked at, demonstrating that
looking under the lamppost is usually not the right strategy. So fast forwarding now 2018, now you have machine learning algorithms that are doing this type of problem, not only better than
the average pathologist, but better than the best pathologist. And you can see on the left
is the Kaplan-Meier Curve for the pathologist and on the
right, the Kaplan-Meier curve for the machine learning
better stratification. And what I thought was
actually even more interesting is the ability to do something pathologist just don’t do it all, which
is by looking at these images tell not only what those prognosis was, but also what is the
driver genetic mutation that gave rise to the tumor, which is has huge clinical significance. So it’s, I think, this is a demonstration that when you look at these images, machine learning is able to do
things that a person cannot. That’s again, the good news, the bad news that I
just like to highlight, because I think it’s a good
warning lesson to all of us, is that machine learning
is when applied to biology, something that one has to
be really careful about. And I think that’s true in general. But it’s even more true in biology. And I’m going to use this as an example. That, because it was a published
piece of work that focused exactly on this, there
are plenty of people who don’t realize that their
machine learning is doing this. And don’t bother to check well. But this is… So this is a paper that
looks at identifying fractures from x ray images
like the one on the left. And what you see on the
in the graph on the right, is the all seeker, look at
this, this is pretty well, you know, it could be better,
but it’s doing pretty well. Well, these folks were
smart enough to actually look at what the network was learning. And these are the clusters that you get in the low dimensional
manifold visualized by t-SNE at the end, you can see that the fracture versus non fracture
actually kind of intermixed in the clusters that you learn. What’s not nearly as
intermixed is the scanner that took the scan. So it beautifully
stratified on the cluster on the scanner that took the scan. And it turns out that
when you correct for that, the perfect becomes effectively random. So what the machine learning was learning is what scanner took the
scan and using that to infer whether the patient had fractured not, now we can speculate why
it is that certain scanners are used more predominantly
for fracture cases versus not, that’s besides the point. The point is, deep learning
is really good at homing in, on subtle signal, it’s also really good at homing in, on several artifact. And artifacts are really
rife in biological data, because biological systems
are very sensitive. And so you can really create
artifact actual results that have zero biological value. And that’s one of the things
where industrialized biology really can make a difference. So I’m so coming down to what
it is that we’re actually, doing to put this all together, we’re building what we think
of the Insitro human platform, we can take data from
patients and controls, there’s not gonna be a lot
of these, if you’re lucky, you get cells from a few hundred patients, maybe 1000 patients, that’s
about as many as you can get. You put them in what is
100,000 dimensional space that you can look at when
you think about the images, or single cell or RNAC, Chroma tabloids or whatever 100,000 of those
actually on the low ball, from the estimate, you
don’t really see anything, because this is very
sparse number of points in a very high dimensional space. But now is where you can
use some of the tools from genetic engineering that we’ve talked about
considerably expand the space of genetic
backgrounds that you look at, so that now all of a sudden, you don’t just have the
patience and controls in your space, you also
have lots and lots of other, types of genetic backgrounds
that you can now measure and look at the cellular phenotypes for now all of a sudden structure, like that low dimensional manifold emerges using the right application of machine learning tools. And what you can now do with that, is to look at, this diagram. And notice now whereas
you couldn’t before, there’s distinct clusters in
the space that correspond, hopefully to coherence
patient populations. And each of those patient
populations might benefit from a different therapeutic intervention that will allow you to
revert the phenotype from sick to healthy. So by doing this, you’re
now also able to move away, from the current paradigm
of doing medicine, which is to take an entire
ensemble of patients who are like, characterized their disease by very coarse grained phenotypic, summaries and say all these should benefit from the exact same medicine. We know that’s not the case, the thing that’s made oncology
so successful in the last few years, is the fact
that we now understand that breast cancer is not one thing, and that there are certain
certain types of breast cancers are actually much more
similar to ovarian cancer, pancreatic cancer than
they are to other subtypes of breast cancer. And by giving her two positive
breast cancer patients a very different treatment than you give, someone who has a BRCA mutation, then you actually end up getting
the kinds of success rates that we’ve been getting in
oncology is by understanding that there’s so much
variability and heterogeneity in the patient population. The CFtr, example the cystic fibrosis, one that I gave you earlier, by sequencing the Cftr gene, you know, exactly what is the genetic
variant the cause of disease, and each of those, there’s
three subpopulations, each of those has a different problem. That’s what made this success possible. So if we’re able to do
that, for the diseases that we don’t currently have
this characterization form, like Autism, which is called
spectrum for a reason, then maybe we’re able to now
understand those subpopulations and construct the appropriate intervention for each one of those. So in order to do this, we
need the right combination, of biological data, factory
and machine learning need to construct biological
data that unprecedented scale, we need to be able to interpret that data using machine learning need to feed back, that, insight from the machine learning directly back to the bio data factory in a very tight closed loop integration. Yes, almost done. And in order to do that,
we also need to have the right group of people, we need to have people who are bilingual and really able to talk the
languages of both biology and machine learning at the same time. So I’m going to end, see (laughs) with a philosophical note, that brings us back to the beginning. If you think back about
scientific progress you can see that in
certain times of history, science has preceded in epochs, were in a certain short period,
that particular discipline has made an incredible amount of progress in a short amount of
time, due to a new insight or a new technology. In the late 1800s, that
discipline was chemistry, where we moved away from alchemy
and turning lead into gold into an understanding of the
periodic table in the 1900s, in the early 1900s, that
science was physics, where we split the atom
understood the nature of the connection
between matter and energy between space and time. In the 1950s, it was digital computing, where you got a device
to duplicate computation. So up until that point only
a person had been able to do. In the 1990s, there was a
really interesting split. On the one strand, there was
data that emerged directly from competing, but I
think it’s a discipline that stands on its own, because it also incorporates elements of statistics and others. On the other strand, there was
biology or quantified biology with dress and all those
developments and chemistry and physics, and has
given us the sequencing of the human genome, and the ability to measure
transcription profiles and the ability to look
at my look at cells with super resolution microscopy
so that you can actually pinpoint individual
molecules within the cell. And these two strands,
basically proceeded in parallel with very little interaction between them, I think the next epoch of
science and in this and actually echoing in some ways, Steve Jobs, who made this prediction
multiple years ago, that the next big, this, the
next big epoch of science in the century is going
to be the integration of biology and technology. And you can call this
biological data science or biological engineering or I don’t know, what it’s going to end up being called. But it’s the ability to measure biology at an unprecedented scale,
to interpret what we see, using machine learning techniques and to feed those insights
back to engineer biological systems to do something that
they don’t normally want to do. And I think that discipline
is going to transform the 21st century. Thank you very much. (audience clapping) (mumbles) – Sir please pass the microphone. – So I have Well, thanks for the talk. Very interesting one. I like to make you to two questions, I think the first one is
easier than the second one. So let’s start by the first one. When you compare the performance of the machine learning algorithms for the detection of different diseases, versus with the specialist, I cannot see never in the competition. How many resources are being
used from the algorithm to develop these predictions
versus the prediction that is done by their
specialist by the human? – Sorry I’m having a really hard time. Maybe you can speak
closer to the microphone. I’m sorry. – [Audience member] Now is better? – Yes, thank you. – [Audience member] So that. (mumbles) – No, I think this is yeah– – [Audience member]
Okay, the first question is about how to compare, the predictions made by the algorithm, in your case, versus the predictions made by the specialist, not
only on the actual performance of prediction, but on the efficiency of creating that prediction? For instance? – The question is about–
– yeah, I get it. – How many resources are needed to do– – Yeah, yeah, so I think
the there’s two questions, one is on the computing resources and one is on the accuracy. So this is very hard
to do apples to apples, because we’re not doing diagnosis here. We’re doing drug development and so, the ultimate product,
if you will, is a drug. And so the question is, fundamentally, does the drug work better
than the standard of care? That’s something that we
will only be able to tell when we actually have a
drug and put it in humans and it undergoes the standard process of randomized clinical
trials, as one should. So the comparison, this sort of, you know, neck to neck comparison, if you will, between predictions made
by human and predictions made by computer, which makes a ton of sense
in the context of diagnosis don’t really make that
much sense in the context of drug design. As to the compute power
that’s used we don’t know yet, It partly depends on how
well Jeff and his team are gonna be doing and
creating new versions of GPUs and new machine learning
algorithms that we could leverage. But when you think about
the fact that drug design from beginning to end as
approximately a 10 year process, if you’re lucky, then if it takes you a
month to do computing versus two months to do the computing, honestly, it doesn’t matter very much. Even the costs don’t matter relative to the $2.5 billion dollars worth, of costs that it’s currently costing us, to run development, which
is one of the reasons why I think this is so important, because any improvement, even if 10% is going to be which I hope
we can do better than that, it’s going to be hugely impactful. – [Audience member] Okay and the– – [Kristian] We won’t have
time for your next question. Thank you very much. (audience clapping) (mumbles) (wind blowing on microphone) – [David] For the last session. (mumbles) – [Michael] May I have a seat? – [David] Now Have a seat, have a seat. You’re gonna introduce yourselves, Mike. – [Michael] Oh, my God,
those lights Jesus Christ. – [Jeannette] It’s like being on stage, (laughing) – [Michael] It’s Klieg
lights, I can’t see anything, is there an audience out there? – [David] I need keep tracking time, because we have to be– – [Jaennette] Oh, we have– – [David] No not that I will
use yours, okay it’s fine. Good afternoon, welcome
to the final session. So for the last session,
it’s it’s a smaller panel, which I’m happy to moderate. Jeanette and Michael will
introduce themselves in a second, we thought it was appropriate
to conclude the day with a session that was
more forward looking. So it’s it’s about the
future of data science. And we have a hard stop here, we have to exit this room
before three o’clock, so. – [Jeannette] That’s why we
have been rushing you all day. (laughing) – [David] Yes, so without further ado, I’ll ask the two panelists
to introduce themselves. And then we’ll Off we go. So Jeanette, – I’m Jeanette Wang, I’m the director of the Data Science Institute
at Columbia University and a professor of computer science. – Mike Jordan across the bay, Berkeley computer science and statistics. – Great so I’m not
we’re not this is going, to be quite on structured. I’m hoping that our
distinguished panelists will talk about many things. But I’m particularly interested in, hearing their views on
on how data science… How research, on
universities should structure themselves in the era of
big data and data science? That’s that’s one question I know. And another is about the relationship between data science and AI? So I don’t, wants to start? – So I’d like to take the first question. In March of last year, I ran a summit, which collected 65 participants from about 30 different universities, all of these participants
were leaders in some way of some data science entity on campus. And I just to acknowledge the
funding sources was at NSF, the Moral Foundation and
the Sloan Foundation. It was a wonderful meeting there. Because the bottom line was, it’s all over the map. And many universities are still struggling or I should say grappling
with where do we put this data science thing? So there were a few models that emerged. And there’s no one right answer. One is where, for instance,
the Columbia University, the Data Science Institute
is actually university level and university wide. It doesn’t belong to any one
school or any one department, its own thing. Another model is, for instance,
set at Yale University and I believe at Carnegie
Mellon was, if you will, renaming of the statistics department. So now it’s statistics and data science. Take it as you wish, by
the way at Carnegie Mellon, there already had been and is still machine learning department. So what’s with that? – [Micheal] Still will run
another number 40 Year. – Another model is the Berkeley model, I will create a thing called
a division I believe it is, – Let’s call it college for this audience. (laughing) – When I was told not to actually use I’ll use the word division ’cause, that’s officially the word. And of now it’s data
and information science or something like that. – It’ll conclude computing at point. (laughing) – And MIT, for instance, before
the new College of Computing before the huge gift, there was an Institute for
Data Systems and Society. And I remember going to the inauguration of it IDSS a couple years
ago, and I asked my mentor, who was leading it, I said,
Who do you actually report to? And he says, I report to five deans. So there’s yet another model(laughs). And then there are, of course,
more traditional models, where the data science
entity will live within, a school or a college on campus. So literally, it’s all over the map. Many schools are starting
to think about creating a school of data science or school a data science and computing. So it’s really it’s, it’s a challenge for any
university administration at this point, – And the right way is? (laughing) – I like the Columbia
way, but I’m gonna say, that of course.
– Mike? – I think of this more as what’s
happening in the real world and less about academia. I think that’s where the
first thrust really was, was an industry. And I think that it’s misnamed. I don’t think this is
about science, so much. I mean, we had big data sets in science and it but it just that
statistics, what’s new, I think it’s something
more akin to engineering. And that term of engineering, I think we’d have to narrow
Yeah, we think cold affleck, tech less machinery, and
loss of control by humans. But this is a new kind of engineering, that is about building
systems that really work in the real world for real people, and reflect people’s desires,
and preferences, and so on. That’s what it’s about. It’s about a human oriented, data oriented engineering field. So I kind of take the historical reference as maybe think about chemical engineering. So think about, I didn’t do
the history, I’m gonna do it. But think about the 20 years or 30s, there were ingredients,
there was chemistry, there was fluid flow, there
was this and that lying around that people were trying
to put it together. They didn’t do a particularly great job at building factories at
scale in the very beginning, but eventually figured out how to do it, and new principles were needed. And eventually a name given to that thing, it was called chemical engineering. And now there’s a department
of that everywhere. That took a few decades. I don’t want to give a
name to this too soon but I think it’s more about engineering. It’s about the Uber’s of
the world, the Amazons and the services and the markets that are being created about that. It’s about market principles
on top of flows of data. It’s about the answer,
certainly in the real world that comes out. And so I think that academics, we kind of have to absorb
that those those millennials, changes, and eventually
created academic discipline and structure and all that. But that’s putting the
cart before the horse to think about what is
just define this thing. And let’s, think about the
academic version of it. No, let’s reflect what are the
real world consequences of, thinking about an engineering
field that is affecting, you know, 500 million people daily, and is, you know, flows of
commerce, medicine, you know, finance, you name it,
and make those systems actually work and not work just for today, but work overtime spans, and
not create messes around them. So I think that I hope you
see my latter comments. We’re not there yet. – But let me, may I.
– Please. – So David said, we can riff
so let me, challenging Mike. So the purpose of today’s meeting, was to convene two purpose is to convene, the computer science and
the statistics community to come together and
talk about data science. But the other purpose
was to really focus on, what we’ve been calling
foundations of data science, whether you call it
science or engineering, I think that’s not the point. So do you see that by bringing
together various disciplines, and I also want to include
signal processing through EE and optimization OR and so on. Do you see that it does make sense to talk about foundations of data science? – Yeah, I think you’re
going back to academics and I’m trying to pull
us out of academics, ’cause I don’t think that’s where, the real phenomenon is happening. So you know, if you try to have
a service in the real world, think about a recommendation system. So you know, spend some time
think about a recommendation system will recommend
books or movies to people, does it at scale, how’s the long tail, it’s economically viable, it’s not just a piece of AI software that people hopefully use, it really it’s a financially solid theme. But it’s mostly been done in the computer for virtual objects, like movies, that you can copy the bits and all, think about doing recommendations broadly and all kinds of real world situations, recommending streets to people to go down, you know, recommend the
restaurants for people to be and recommend the financial services that people will avail
themselves of what stock to buy, what medical thing to
use, and so on so forth, you can start to see
there’s gonna be conflicts, there’s gonna be scarcity, there’s gonna be game theoretic issues. And if you don’t face those on day one and think of an engineering field that includes all those issues, you’re not building a system
that’s really gonna work. And we see that where people
are building lots of systems that kind of work for a little while. And then there’s all these
problems that come out, because they didn’t think
through all of the incentives, all the consequences,
all the externalities around those systems. So, yes, to build those kind of systems, you have to know what academic,
you know, past is needed, what some of the foundations might be. But it’s kind of obvious
it’s statistical inference, its prediction, it’s saleability,
its economic principles, micro economics, Game Theory and so on. But it’s look at the problems
you’re trying to solve and less about the historical fields that gave gotten certain
names associated with it. – Can I be that Be that
as it may, you know, there’s a risk that it’ll work out badly if we don’t if we don’t go
about it in a planned fashion. And particularly, my mind
as a statistician is, there are status in any major university, there are statisticians
all over the place. This they call themselves
psycho statisticians or bio statisticians or
economy statisticians or statisticians or bio statisticians and it’s a problem, because
they don’t talk to each other. So there’s a there’s a real, you know, we’re missing an opportunity
to have cross virtualization– – [Michael] Again, you guys
are going back to academics. And I know that academics
don’t talk to each other is for 3000 years. I mean, that, you know, (laughing) But no, I mean, I’m sure
Berkeley, we have a model where we talk to each other, I don’t know what what’s
happening in other universities. (laughing) But we talked to each other
across all those disciplines. And sometimes the same ideas
get reinvented and all that, especially our grad students,
they talk to each other, it’s just gone. I mean, I think these actually academic issues are secondary. You know, so yes, we have
to figure how to teach the undergrads and that’s one
thing we focused on a Berkeley at the grade students, will figure it out. I got no worry about them. And how to teach
undergrads, that’s an issue which we could talk about. But ya know, the the issues
of the real world once the academics are secondary. – Should we have a PhD in data science? – No.
– We will. – In 20 years, just like
chemical engineering eventually got a PhD, I doubt
they had one on day one. – But we will, like within
the next three years– – And we may might have– – Well NYU already has one. – Which is like we had machine
learning the Carnegie Mellon, no one else copied that 40-30 years later, it wasn’t a great idea. – It was or was not a great idea. – Was not a great idea. (laughing) – Okay, that’s what
about this new division that Berkeley’s is creating. – So data science temporarily
until it eventually evolves in its big form, is an umbrella term. It’s the first umbrella term
I’ve seen in my whole… 25-30 years. So think about someone
who does databases, right? They work with data, right? So they you ask them,
are you a data scientist? They’re mostly gonna say
“yes, yes absolutely.” If you ask them before, are
you a machine learning person, they would have said “no,”
if you ask your statistician, they would said “no,” well,
statistician over data, aren’t you a data person? No, I’m a different kind of data person. But data science, every statistician thinks they’re a data
scientist and they rightly so. And a lot of other people do too, but including the database and
distributed systems people. So that’s, that’s why it
was a useful umbrella term in the academic side, was
because he had allowed all those people to feel
like they did belong together and they would start talking together. But that had already happened in industry. You can’t go into Amazon
20 years ago, Google and the database people
completely separate from the statistic people
that just didn’t happen. They had to solve a problem together. Yeah, so no, I think that
it’s, it’s so just like engineering or, you know,
chemical field in the past. It’s not a department size thing. It’s not an institute
that’s too small for it. It’s an umbrella term, so it should be one of the, new engineering disciplines emerging. So you know, think about
the colleges we have now in universities biology,
Humanities, Arts and science, physical sciences, all
they existed 100 years ago, what new college do we have
in the last hundred years? basically nothing, right. And people wanted for
well College of Computing, that’s too narrow, influence, minimally, an ideal economic and social principles. So now we’re starting to perceive that, that’s the college that’s
why should get college. And just to say, Berkeley, we
know it, it takes six years to create a college because
the regions have to rule in, so we couldn’t call it a college yet. So we have to call it a division ’cause you can do that on today. But it will be a college
because that’s the right level of granularity of the whole enterprise. And I don’t think it should
be called data sciences, ’cause I don’t think in 10
years, it will be data science, there’ll be some other some other new nice terminology will arise. – So what about AI? Switch topics Ai. First what is AI? Where does it fit in all of it? (laughing) – So I let me jump in first this time. So I wrote a blog post
last year Xiao-Li Meng is in the audience and he’s put
putting into his first issue of Harvard Data Science Review. If you haven’t already
advertised that to this crowd. – [Xiao-Li] I did. You did?
– He did. – He shamelessly–
– several times probably. Yeah, so it’s trying to
do a little bit of a both chronology, but also kind
of typography, what is AI? it I like I use this phrase,
it’s an intellectual wildcard. Right now, people use
it when they don’t know what they’re talking about. And yes, I have Mark Zuckerberg in mind. (laughing) But there are many, many things
that are not classical human imitating AI of the
Turing or McCarthy style. There’s something you know,
augment humans intelligence with computers, you know,
a search engine does that. You know, it takes in
massive amounts of data, it processes that it
makes us more intelligent. That’s, what’s been happening
in the last 40 years. And there’s something else
like Internet of Things, data flows, and the medical
system commerce and all that, sort of big systems level,
he put a thing around that. That’s another kind of thing. I would not call all those AI. I think that’s just not
you know, it’s, I mean, first of all, none of it’s
about intelligence per se. Reading me thought,
it’s not about thought, it’s about systems that
work and make predictions and make inferences and
decisions, and so on, so thoughts. So it’s not about thoughts,
calling in, or intelligence, per se, it’s about systems. – So I would like to address
both of these topics. With my own perspective. I completely agree with
Michael, in terms of right now, data science, whatever
you want to call it, what’s going on, and what
people call data science is an umbrella activity. It’s one of the reasons
that I think the way in which Columbia structure
the Data Science Institute makes a lot of sense, at least for now. And whether it will evolve
into a college or school, that’s that’s for the future to determine. But right now, in terms of
the way that data science is an umbrella activity, it involves multiple
disciplines, for sure, foundational ones, like
math and statistics, and computer science
and double-E and so on, but also every other domain on campus. And so the Data Science
Institute at Columbia, for instance, interacts with collaborates and brings in faculty
from every single school, law, business, journalism, medicine, social work, you name it, it’s there may have many facts later part of the data science effort on campus, I think the same is true at Berkeley. And the same will probably
be eventually true at most other campuses. So umbrella activity, whether
whatever you want to name it, it’s happening, because everyone has data. And everyone wants to do
something with that data, in terms of extracting the
knowledge and information and the value that that data contains. So that’s the umbrella
thing, and from that word, umbrella will agree with you, Mike. On the second question of, AI, you know, this is just me
liking to partition the world, but I see AI, data science, machine learning all overlapping, I see. They’re not all the same thing. I think the AI community
would say machine learning is part of AI. And in the data science and AI
and machine learning overlap. Data Science involves more
than just the analysis of data involves everything from
generating data, collecting data, processing the data, storing
the data, managing the data, visualizing the data, interpreting
the results, and so on. There’s the analysis is
what has made data science so cool now because of machine learning and machine learning is part of AI. So that’s my, partitioning of the world. And of course, the data
database, people are gonna think that they are part of
this, entire spectrum of, the data lifecycle, we can’t do without them. And by the way, you
know, when I think about, in computer science,
the big data and so on, this is predates deep learning, which go back to Jim Gray’s– – [Michael] NO, no deep
learnings from the 1980s. All the ideas were there, David Reinhard, I don’t know if his name is today, but it was all there in the 19980s– – So Jeffinton and I were
colleagues at Carnegie Mellon, together when he was first– – my advisor was David Reinhard, and he did back Bobby’s in 1988– – So the point really, is that I think, if one thinks about the data life cycle than one can see that there are, there is plenty of room
for people who are doing, data collection, data generation, even people are inventing sensors and so on all the way to the analysis people all the way to
the people who are doing the visualization, and so on, of course, it’s data management,
data storage, database, information retrieval,
people see their themselves as part of that data life cycle. – I finally just got… Totally remarkable that. on my bookshelf, in my office, I a four volume Handbook
of artificial intelligence, which some of the people in
the audience probably have to, I think it dates from like
the 1990s, early 1990s, it’s a few thousand pages of material, there is nothing in
there about probability about probabilistic
inference or about data or statistical inference or data science or anything you like. it’s a handle that has completely changed in what it appears to be– – That’s right, it’s a brandy. And so I will, I will push back on, I do not think machine
learning is part of AI, that just that just not
historically correct. Most of the current
thing, machine learning, is kind of optimization, control
theory, signal processing, you know, statistics, etc, etc. And that stuff was pushed out of AI for most of its history, and it developed completely separately, it has its own, you know, conferences, its own history, its own culture. And so putting the word AI
and all of that is just, historically, just wrong. My statistician colleagues are, they all call themselves
machine learning, they have to, you know, to get the grant money. (laughing) But they’re not gonna call themselves AI. That’s just, you know,
that’s this is not right. – But lots of other people will, – A lot of people will other in industry, and in the newspapers, you know and people will gradually go along with their aspirations are different. their aspirations are not that of McCarthy of trying to put thought into computer, which is a great aspiration, absolutely nothing wrong with it. Their aspirations are
very, very different. It’s to try to feel
what’s out in the world, help do science help build
the engineering systems help, you know, make decisions,
and so on so forth. And decisions not at the
time that humans make, but decisions that have to
be made in the real world, because the real world is the way it is. We have about 10 minutes. Would you like to take some questions? – Sure.
– Yeah. – If you like to go to the mic, if you have any questions, – Somebody’s been more
provocative than me. (laughing) – Andrew? (mumbles) – Yeah, I’m in trouble.
– Are you there? – [Andrew] Yeah, I actually, I’m not thrilled with your
crowd pleasing academia bashing. (laughing) – [Andrew] Let me just say– – Come closer to the microphone, we don’t hear very well. – [Andrew] Okay, I had one bone to pick with your crowd pleasing
academia bashing in that, I think that the problem was communication is difficult, outside of academia, also, I think there there are a lot
of people who do statistics are the equivalent with a big companies who have difficulty communicating, to so I think the issue
that David mentioned about having statistician
spread out all over the place is a real issue anywhere. But the real thing, the thing
I wanted to ask it with, it came up, actually, in the two previous keynote
talks, which were so amazing, and what it seemed… I wonder whether a lot
of applications have a, kind of a last mile problem in
that there’s this incredible inferential framework, which
allows computer scientists to solve problems that
we couldn’t solve before and I guess, it allows
statistician to do it too, because we can run the computer programs that computer scientists, right
for us to do deep learning. And it’s just amazing. But then I wondered, like,
at the end of all this, like, there might be a
development of a new drug. And then it gets tested using
this very crude approach where all of the depth gets forgotten. And you just have, you know, two groups of 200 patients each, and they check the p value or whatever, not using all the
structure that came before. And so I wondered what what
you all think about that, like the integration between like, if the problem is
recommendations on the computer, it’s kind of fine, because
there is no last mile it like goes directly into your computer. But if the problem is something
like drug development, then I, you know, I Is
there a way to kind of, connect that to
probabilistic understanding. – So I don’t think you’re pushing back. I don’t care… I agree completely with
everything you said. Maybe another slightly way
different way to say it is that, the focus of the machine learning crowd right now is on pattern recognition. It’s student heart on steroids. If you’re a statistician,
physically minded person, there’s two sides, the
field pattern recognition and decision making. and decision making is vast, even like he making one
real world decision. It’s that’s consequences. It’s not about just
gathering data and getting out an output of a neural net. So think about going into
your doctor’s office, a doctor measures your genome,
your height, your weight, your blood pressures on
40,000 dimensional vector goes into a neural net,
if the number that comes out is over point seven, you’re supposed to have
your heart operation, it comes out to be point 701, right? Is that the end of it? Is that the decision, you’re
gonna have no dialogue, you’re gonna say, what’s your error bar, first of all, but you also say, what’s the provenance of the data? Where that data come from? Is it about people like me? How old is it? What machines were used? You’re gonna ask a do what experiments? What if I were to do more exercise? What if this or that which
is going out beyond the data at hand and thinking about
others kinds of data. And that’s just one decision at one time, real world decisions are
about collections of decisions all being made jointly,
about economic consequences, all those things that are acting, as you can see the vast scope of it. Currently, machinery is
all focused on neural nets and deep learning and pattern recognition, taking unlabeled data and that’s all it’s
about right now, mostly. Now, if you go to the
research conferences, no. But you know, that heritage
of decisions plus influences plus data, that is the 300
year old statistical heritage, and that’s what we should be
inherited and pushing forward and making computer science real that really works in the real world. So I agree with everything you said. – Question. (mumbles) we can’t hear you. (mumbles) – There we go. – [Woman student] Thank
you for the discussion. I’m a student from UCSB, you
mentioned the recommendations are not just about movie recommendations. Soon, we will be recommending
medical treatments and we’ll have more
important recommendations. So whether your ideas of
democratizing those models that we will deploy in the world? Do we get one training sample
per person on the planet? Or do we get one training
sample per dollar paid? Or per CPU owned? Or how do you think we should solve this– – Easy said, I got a simple
answer you’re not gonna like, but it’s market instead
of just computer science company telling us what to do. So imagine I go to Shanghai,
I walk out on the street and I want to eat, I’m hungry right, and I don’t speak Mandarin. And I don’t want to get
a list of advertisements. I don’t want to get a
list of recommendations from some company that
thinks they know about me, they followed that
browsing history, right? Rather, I want to be connected
to all the restaurants around me by a mechanism. And it has to have a little
bit of a recommendation knows about that I like Sichuan cuisine and knows I have a certain price point. I know some things about me. But not everything doesn’t
know my mood in that moment. And then I went to the
restaurants to see me on the other side of a market. And the restaurants look at
me and other people like me and they make start making offers, we have a matching protocol. And so we have recommendation
system, two sides, matching protocol, all big happy market. Now, not all of a sudden it
gets set to the same restaurant, because it’s the most popular
restaurant or somebody rigged it or something,
we’re gonna participate and it’ll be more democratic just because we all have a preference
that the companies, don’t know, so I think the
IT companies don’t get this, I think they’re just seduced
by the advertising model, that the advertise model is that we know everything about you. So we can tell the advertisers
so they can advertise to you. If you get off that mentality, you say, No, I want to
connect you to the restaurants because you guys didn’t know
about each other in a way and I’m just gonna help out and sit back. So yeah, democratization is a great word. But to me, it’s mostly just about starting to think about markets instead of services that we provide to people because, we know everything about them. – There’s a very… I think a lot of companies have been talking about what people are calling data marketplaces
and data exchanges. – Yeah.
– And this is really getting, a question of what is the value of data? And of course, that’s a big question. There are lots of lots of
nuances to that question. But the idea of the data marketplace, I think, especially where
you’re talking about doing some matching on both
sides, seems very promising. – Yeah the market is one– – It changing the paradigm– – it should be embraced,
it’s not been embraced. But markets are mostly
the 3000 years of markets have made us rich and happy and wealthy, we would all be up in the trees Still, if we hadn’t started come
together and be trading and having markets so that
we can create new ones that are, you know, and
domains like you know, finance, or music or whatever, we connect producers,
consumers in new ways with all of our data, instead
of just trying to think of as we start, we get data. We don’t give you any value
we advertise based on it. Or maybe we’ll give you a
little bit of data market will give you tiny amounts of money, ’cause your data is that I mean, I agree with a principle
there, but it’s not. I think there’s much
more to exploit about, people who have not been
ever connected before, can be connected and now a
market can be created on. – who was next I wasn’t paying attention. – [Audience member] I
have a comment to make, and a quick questions. The comment is about the different models you mentioned earlier, in terms
of a data science program, I would like to mention at Penn State, we have an undergrad data science program that started four years ago, which include three colleges currently, computer science, in engineering, statistic in Computer Science and College of Information
Science and Technology. And so the programs
drew courses and faculty across the colleges. Now we are expanding
other colleges to include application domain areas. So just just a brief comment, the question I want to ask to the panel, instead of, you know, kind of thinking about what data science mean, currently, whether it is from the
real world industry drive or from academic discipline viewpoint. I think my question is, what
do you see the relationship between data science and causality? – And causality, I’d say– – causality he’s been around for 100 years in statistics, so it’s in. (laughing) – And they’ve to be a little
less facetious causality is about knowing if I turn this knob, something will happen
in the real world right? And if I’m going to affect the real world, as we’ve been arguing, we
should be do public policy, do systems that work in the real world, I don’t know what what things to twist to make the real world react. And that’s causal inference, it’s been around for a long
time, and still always will be. – I want to say another
comment about the data science in academia or industry
doesn’t really matter. One of the reasons I think
we need to take this question of what is the data science seriously and have some control over what it means, is that I think an industry
the title “data scientists” can mean a lot of different things. It can mean any one from
who’s doing data cleaning, and data wrangling all the
way to someone with a PhD, in machine learning. And industry, itself is
grappling with that problem. Because when you’re advertising
for your data scientist, what are the skills that
you’re really what you really, want for that, and this is
where industry and academia can actually have a conversation, because we in academia
are producing people with data science skills, through courses through degree programs, and so on. And we should be matching
the needs of industry and talking with industry and saying, well, this is what it means
to be a data scientist. – One last question. – [Man audience] Great, I think I’m next. So I love that you framed much
of what we call data science, today’s engineering and in
fact, really problem solving. At the same time, there’s a difference between problem solving and engineering. Engineering has a robust basis
in understanding science, we license Engineers, and we expect, them to have some fundamentals. And I see way too many
data driven problem solvers who ignore that, who go out who take data and have no idea of the distributions, who take survey data and
don’t know about the sampling and go and do things. So do you have thoughts
on what we will require of the future engineer data scientist, to believe that to master in
the way of basic training, to be able to carry out
this profession responsibly? – Well, in some ways, it’s obvious, it’s got to be everything you just said, specifically inference ideas, needs computational
modularity and scalability and modeling ideas. So let me just allude to
something we did at Berkeley, we, our freshman level class has become kind of famous now. And we designed this to be
a Python class effectively will teach you python programming. But wherever you had a kind of classical, what are you using for? And the answer was
maybe a fibonacci series or get out or example will
be, will teach you how to play a right into a game. Take that out and put in a
statistical inference problem, So if you go into industry, on day one, you’re gonna probably be asked
to do an A B test, right? Here’s our website before
we’re gonna tweak the website. Here’s the website after Is it better? You’re gonna have 1000
people on this website, 1000 people here you have
two columns of 1000 numbers. So I did a little experiment, I went to my computer science
colleagues at Berkeley, and stood up and said,
we’re going to teach a class where we take out fibonacci
series and we put in A B test is exercise number one in the class, we’re teach just enough
Python to do such like, Can you guys give me an
idea of how you would solve that A B test? A 1000 numbers here 1000 numbers here? Are they really different? And I couldn’t get any of
the any of my faculty members to say they knew how to do it, someone sort of raised their hand said, “well, maybe a T test.” I said, Wait a minute,
is no Gaussian here, these are two columns, the numbers, no one knew about a permutation test, which is a beautiful,
simple solution to this, it gives you inferential,
you know, beautifully done. And all you need computer science wise is the permute list of numbers
again and again and again. So students can learn to do that. And they can love doing it on with Python. And then they can try to real data. And they can actually
make real inferences. And that’s like after three
weeks in your freshman year. And then you repeat that
cycle about 10 times during the semester on the other topics. And someone come out at
the end of that knowing how to program Python,
but knowing how to solve real world modern data problems, and then actually having
done work with real data and coming out super excited. So we have 1500 students, and
now semester taking the class and they love it. – Jeanette gets the last word. – I want to speak more sort of
what just because Mike talked about the Berkeley day course, I just wanted to convey
the rigor of the master’s program that we have a
Columbia where we require three computer science courses
and three statistics courses. We also have a capstone course, which is where we use
have to industry affiliate bring in real world data, or students are working in teams
solving real world problems. And that is also when we raise
questions of data ethics. So we try to do in the seven requirements, plus real electives, to at least set a bar for what it means for
someone coming out with, a data science degree. And these are this is a this
is a very rigorous program. Highly quality quantitative. – Super, would you like
to make closing remarks? We’re out of time, that that lies. It’s actually three o’clock? – Oh, well No, we’re heading to. – Go ahead say something grand. – I get to say something
grand when it when we close? You can say something. – Okay, now’s the time to take the podium or you’re gonna do it from where you are? – I’ll take the podium. This is Jeanette wearing multiple hats. Actually, I’m really
substituting for someone who couldn’t come at the very last minute. So first of all, I want
to thank all of you, we have to close immediately. But I understand that not all of you heard my introductory remarks. I’m not going to repeat all of them. But there’s a couple that
I do want to emphasize. That is first of all, ACM
and IMS have agreed that, we want to continue this event, it will be focused on
foundations of data science. And we want to bring
in other organizations to represent the other
fields that contribute to the foundations of data science. And so we so be on the
lookout for this event to happen again, with other fields and other organizations participating. So thank you very much
for staying the whole day. And I hope you enjoyed it. Thank you. (audience clapping) (mumbles)

One Reply to “ACM-IMS Interdisciplinary Summit on the Foundations of Data Science”

  1. "Learn from yesterday, live for today, hope for tomorrow. The important thing is not to stop questioning." author ~ "Albert Einstein"

Leave a Reply

Your email address will not be published. Required fields are marked *