Mark Newman 3 – Using Networks To Make Predictions


good evening everybody can everybody
hear me yes my name is Jeffrey West and I’m on the faculty at the Santa Fe
Institute and I had the great honor until about a year or so ago being
president and I also have the honor of introducing the speaker for the last of
the UM lectures you are all presumably familiar with the fact that every year
we put on this series of elam lectures where we invite a distinguished speaker
to talk in a subject that we think is of a general interest but also talk about
it in some depth over three evenings and this is the last of those three and I
presume that most of you have been here the other two but I will repeat some
things just to make sure everybody’s sort of speak on the same page our
distinguished lecturer is is indeed distinguished it’s Mark Newman sitting
down in the front him you were here again in a few minutes mark is very well
known to the Santa Fe Institute community he’s well known actually
within this community because he has given a public lecture in the regular
public lecture series but he’s internationally known for the work about
what she’s been talking namely this kind of revolution in thinking about many
problems in terms of network structures and network dynamics mark was he
graduated he got his doctorate from Oxford and after Oxford spent a couple
of years in a postdoctoral position at Cornell and before moving to the Santa
Fe Institute where he was a research professor and when it came time to move
on he moved to the University of Michigan in the physics department at
the University of Michigan associated with there also with their complex
systems Institute and there he’s had a rapid rise
I don’t know the exact title I know it’s the the dirac professor but he’s a
university professor basically at the University of Michigan he’s had an
incredibly successful career with tremendous impact especially in this
area to do with networks and the important thing about that is that it’s
very much in the Santa Fe Institute tradition of it’s not some narrowly
defined field but it is something that has implications that you’ve heard about
in the last couple of nights but you’ll hear much more of I think tonight in
terms of its impact on fields stretching across the whole spectrum actually of
science and technology and Mark will just touch on a few but it’s as I say
it’s very much in the center of an institute tradition because the Santa Fe
Institute was set up and indeed practices the idea that there many of
the most interesting fascinating and urgent problems facing science and
society today are ones that require very broad thinking across multi disciplines
and are very often to do with systems that we call complex everything from
questions in biology questions of ecology questions of social organization
questions of economics and markets all of these have this highly complex
character of many agents interacting with one another giving rise to
extraordinarily complex quote behavior with a phenomena that encompasses many
scales both in space and time things happen on very short times but also
happen on very long times they have been in short distances and long distance all
interacting together and these are some of the most challenging problems and the
development of thinking about these things in terms of networks whether it
is within the cell within an organism within an urban organization within
markets within social interactions this has been an extraordinary tool in coming
to terms with not only the understanding in
qualitative and quantitative terms but taking that understanding into a
predictive framework and mark has been certainly one of the great leaders in
this internationally recognized as such and we’re absolutely delighted that he
agreed to give these lectures and come back to Santa Fe and give us a big
overview of some of the exciting things that are going on in that field and to
which he’s made major contribution so with that I’ll turn it over to mark
thank you very much Jeffrey for that nice introduction hello everyone so just
in case you didn’t make it to the previous two lectures let me start off
by giving you a brief rundown of what we talked about so forth so this the topic
is networks as Jeffrey mentioned this is the last of these three olam lectures in
the previous lectures we talked about first of all what is a network it’s a
bunch of dots joined together by lines we call the dots nodes for lines edges I
will use those words frequently the reason we’re interested in networks is
because they can be used to describe all sorts of things in the world
as Jeffrey mentioned things like technological networks like the Internet
information networks like the World Wide Web biological networks like food webs
social networks which link people together in various ways and then last
night we talked about once we know about these networks what can we do with them
what can we understand better and I give you a bunch of examples of things that
we can say for instance we can talk about the idea of centrality which
answers questions like which are the most important nodes in a network which
is the most important website which is the most important person in a social
network transitivity the idea that if you have two friends then they’re quite
likely to be friends of each other as well by virtue of the fact that they
have you as a common friend and we can use this fringe instance to predict who
will be friends in future people who don’t know each other yet but have a
common friend are likely to become friends we talked about homophony the
idea that people tend to be friends with others who are similar to them in some
whay and modularity the idea that networks divide into these groups and we
can use this for instance to predict how networks will split and then reform so
all of these are very important central ideas in the study of networks and very
useful in many practical circumstances but they’re also limited in that these
are all questions about sort of static properties of networks think of the
network is just I’ve taken a snapshot of it here it is this is the network it’s
fixed and now I’m going to calculate something about it however many of the
things that we’re interested in about networks concern dynamics things
changing on networks in fact there are two ways really that we can look at
dynamics on that works one is cases where the network itself is changing so
nodes can be appearing and disappearing edges can be appearing and disappearing
and the other is that the network can be fixed or not changing very much but
there can be something dynamical going on on the network for example a rumor
spreading through a social network or something like that would be example of
a process taking place on a network so I wanted to talk about these kinds of
ideas of how things change in dynamics today I have to warn you that this is an
area which is not nearly as well understood as some of these static ideas
that I talked about yesterday there’s many open questions concerning
dynamics on and of networks but there are also some areas that we feel we
understand reasonably well so there’s some some specific cases were beginning
to make progress with a few specific cases in understanding them and it’s
those I’d like to talk about today so the first example I want to give you
it’ll this is going to end up taking you in a variety of different things but I
want to start in a specific place and talk about citation networks what is a
citation Network a citation network is a network of documents so suppose I have
an example would be legal cases in this country when a judge decides a legal
case they write an opinion to explain their thinking and how they decided what
the what the verdict should be in the case and often in fact virtually always
when they write their opinion they will ie refer to previous opinions written by
other judges to establish precedent this case was kind of like this other case
and so I took my thinking from that case so now you have a network where the
nodes in the network are legal cases legal cases legal opinions and the edges
connecting together are this opinion cites refers to this other opinion
another example of a citation network probably the most famous example is the
network of scientific papers when we scientists discover something which we
think is exciting and other people would like to know about it we write papers
about it to tell people to explain the idea but usually in fact basically
always we’re basing idea our ideas we’re building off things that other
scientists have done in the past so we cite ie refer to previous papers
that those other people have done in order to give credit the people who came
up with the ideas that we’re building on so here’s a paper of mine for example if
you turn to the end of your typical scientific paper there is a bibliography
which says you know these are all the other papers that I based this work on
and you can look at one of these papers for instance this one here it gives a
reference what Journal and page I should go look it in and you can go to the
library pull that volume off for the shelves and find the paper that’s
referring to so you can there’s a connection here from one paper to
another paper the nodes in the network of the papers and the edges joined
together and this one cites this other paper networks of scientific papers are
actually an interesting case there one of the things that have been studied for
the longest this famous paper paper by Derek Gisela Price was published in the
journal Science in 1965 that’s oh here’s a picture of a citation network I’m
gonna skip over that here I want this here’s price so this is one of the kinds
of networks that we studied for the longest I mentioned before that social
networks are probably the oldest example of quantitative scientific study of
networks been studied since the 20 or so 30s but this one is getting close this
has been since the 1960s people have been doing this about 50 years people
been studying these citation networks it really started off with this guy here
his name is Jean Garfield he’s still alive I’ve met him a couple of times
nice guy what he was doing back in the sixties is he was creating what we would
nowadays call database of scientific papers that have
been published and who they cited which ones cited which other ones I mean I
call it the database I don’t think he had a computer I think was probably just
a really big rolodex but anyway he was he was assembling this database it
eventually became something called the science Citation Index it still exists
to this day and very important tool that scientists use every day and he had this
friend Derek de sóller price price was an interesting character he was a
physicist by training he had a PhD in physics but he ended up being more
interested in the history of science and he became a professor of the history of
science at Yale University but he was he had very white interests I mean he wrote
about all sorts of things not just the history of science and amongst other
things he was interested in the patterns of citations between scientific papers
so in new Garfield and he was able to get a hold of a copy of this database
and he did some analysis on the database and discovered a bunch of interesting
laws that seemed to govern the way scientific publication goes and probably
the most famous of those laws is what he did is he looked at how many times
papers get cited how many other people referred to this scientific paper of
course there are some papers that referred to a lot you know if it’s a
paper by Einstein then referred to by many other people and but there are
other papers which hardly get referred to at all
so what he did was he looked at the distribution of how many times papers
get cited and what he found was that there are a lot of papers that only get
cited a small number of times and there are a few papers out in the tail here
that get cited a large number of times okay so this if you’ve been in the
previous lectures then this should be a kind of familiar pattern actually this
is nothing other than what we called the degree distribution of the network
previously remember we talked about a degree of a node in the network as being
the number of connections it has well the edges in this case are one paper
slighting another paper so the number of connections that our paper has is just
the number of papers that cite it so this number of citations you get is
nothing more than your degree in the network and what we’re talking about
here is the distribution of degrees we talked about this for instance in the
case of the internet here but actually this is a very common pattern you see
this long tail pick where there’s a lot of papers that get a
few references and there’s a few papers that get referenced very many times this
particular form as price showed follows a mathematical rule called a power law
or a Pareto distribution very good so this power law is actually kind of a
surprising thing it’s not what we normally expect to see and I’ll explain
what I mean by that so take a look at this plot on the left here this plot on
the left shows you know in principle this is kind of a similar plot to this
one down here but it’s for a different thing this one is for the heights of
human beings actually male human beings so this is heights of male human beings
from a recent survey expressed in centimeters and what you see on this
plot is that there are almost no people who have height below 150 centimeters
and almost no people who have height above 200 centimeters and virtually
everybody actually is sort of in the middle here around about 170 centimeters
which is about 5 foot 8 inches so what we’re saying is there’s a very narrow
range of possible heights that people have this is quite different from what
we see with the citations with the citations there are some papers that are
get cited once and there are other papers get cited a thousand times right
we don’t see that with the heights of people right there are people who are 6
foot tall but there aren’t people who are 6,000 foot tall right we don’t see
the same range so there really is a huge difference between what we’re seeing in
the citations and what we’re seeing in most other things like heights of people
I just gave one other example over here this is speeds of cars measured on the
highway so you see cars going at 40 miles an hour and you see cars going 100
miles an hour but you also see cars going at a hundred thousand miles an
hour on the highway right so what we’re seeing here with the citations really is
a radically different thing from what we’re normally used to seeing in other
things that we look at so let’s take a look at some of the actual data on
citations or papers these are a recent data they’re not the 1960s data these
quite new now they’re from the 1990s this is a data set compiled by Sid
Redner it contains seven hundred and eighty thousand papers about physics I
mean there are really a lot of papers about
first thing to notice is that 368 thousand of them have never been sighted
at all that’s 47% so nearly a half of the
physics papers have never been cited by and nobody has ever referred to them not
even the person who wrote them has ever cited them so if you write a physics
paper in your psyche even once then you’re already in the top half of the
class then then if you also how many papers got one citation well now it’s a
goes down pretty quickly seventy thousand two is 44,000 32 25 21
thousands go down pretty quickly but there’s this long tail if you go to all
the way down to the bottom you find that there’s one paper in this particular
dataset that got eight thousand nine hundred and four citations that’s the
long tail that’s the hub that’s the node which has very many connections in the
networks very good so this may seem like a slightly esoteric example you know you
know who really cares about citation networks or scientific papers except
maybe some scientists but it turns out that this kind of pattern this power law
or Pareto distribution is something that we see in a bunch of other complex
systems that were already interested in here’s a whole set of examples of things
that also follow power laws so we’ve seen citations but also for instance
word frequencies in any language there are some words most words in fact which
only get used occasionally but there are some words like and and ver they get
used all the time very frequently so they’re out in the tale web hits if you
look at how many people look at a web page most web pages only a few people
look at them but there are a few web pages like Yahoo or Google or CNN or
something there are lots and lots of people okay
so they’re out there the hub’s out in the tail people’s wealth most people
have a moderate amount of money but there’s a small number of people you
know Bill Gates or whoever who have a really stupendous amount of money
they’re the ones out in the tail again also name frequencies for instance so
most family names surnames are not very common most of them are just a few
people with those names but they’re a small number names like Smith
lots of people have them populations of cities most cities pretty small a few
cities la New York very big so there’s a whole bunch of things that we have here
that follow the same power law behavior it’s a distinctive and slightly
surprising thing and we see it in these various different systems so what I’m
going to take a momentary detour just to discuss these power laws for a moment
because this is really a fascinating area in in complex systems these power
laws imply various things about the systems they describe I just want to
give you one example of something that you can do that you can calculate this
is something called the 8020 rule so let’s take the example of people’s
wealth how much money people have so if you so you know what not everybody has
the same amount of money some people have less money some people have more
suppose I were to take the 10% of the population that have the most money and
then ask what fraction of the total money in the whole country does that top
10% have well it’s not going to be 10% it’s going to be more than that because
these are not just average people from the population these are the richest
people so they’re going to have more than 10% of the money so we can make a
plot like this one here horizontal axis the fraction of the population so you
know what what fraction of the richest people am I talking about and the
vertical axis is how much of the money that they have so for instance I could
say I want to know about the top 20% of the population that’s point two here
follow the line up and across and it tells me that about 86 percent of the
money is in the hands of the top 20 percent of the population the various
curves here depend on the exact details of the
power-law behavior but that result there 20 percent have 86 percent of the money
is actually about right for the United States it’s true that about the richest
20 percent of the country have about 86 percent of the wealth and the remaining
14 percent of the wealth goes to everybody else the bottom eighty percent
so this curve is sort of a measure of the way the wealth is distributed across
people you can do similar calculations for other things that follow power laws
for instance it turns out that ten percent of the city’s the biggest ten
percent of cities have sixty percent of the people in them and the other forty
percent of the people live in all the other cities put together
75% of people have surnames in the most popular 1% of surnames just the top 1%
and the remaining 99% of all surnames could account for just the remaining 25%
of the population so this curve is sort of a measure of how skewed the
distribution is what a large fraction live in the biggest cities or have the
most common surnames and so forth so this is this itself these power laws are
a big area of study in complex systems but I want to get back to the networks
we started talking about and ask a slightly different question which is
well if we have these power laws in all these things where does it come from
there are various theories about where power laws come from but one of the
leading theories is the theory of preferential attachment this is
something that price himself worked on in the 1970s because he was interested
in these citation networks so let me describe it in the language of prices
citation networks his suggestion was so each paper published cites some number
of previously existing papers and his proposal was that they cite previously
existing papers in proportion to the number of people who cited those papers
before so this is a sort of rich get richer phenomena it says if you lots of
people have already referred to your paper then you’re going to get more
citations and if very few people have then you’re not going to get very many
in a way this seems like a reasonable thing because one of the main ways that
scientists find papers is that other people cited them you know some somebody
referred to them you say oh I see this person is referring to this paper I’ll
go look at it and you go look at this paper and you read it and maybe you get
some idea from it if that idea ends up going into your work then maybe you cite
it too so paper is more likely to get another citation if lots of people cited
in the past on the other hand this sign that sounds a little unnatural in the
sense that it has absolutely nothing to do with the content of the papers
doesn’t matter what your paper is about whether it’s a good paper whether it’s a
terrible paper right in this theory in prices preferential attachment theory
it’s all just a question of the rich get richer
depends on how many citations you’ve had in the past so that may seem a little
bit unrealistic but wait wait a bit well we’ll see what happens so price took
this seemingly simple theory and he gave it this complete mathematical treatment
of it so you can calculate all sorts of things in particular you could calculate
how many citations papers are going to get so there was some math that went in
here and some formulas that come out at the end we don’t really care about
formulas what we really care about is the picture this picture shows the
distribution of the number of citations the papers get the number of times
they’ve been referred to and that’s that’s the sort of purplish brown
circles in the background there and this dotted line is the prediction of prices
theory so it really does pretty well despite the fact that this is a theory
which totally ignores the actual content of the papers right it’s just assuming
that your paper gets cited because other people have cited it in the past so this
is a nice theory but it turns out in fact that you can do a lot more with
this than what price did so here’s one example of something else we can
calculate from his theory so we don’t bunch more mass than we get out some
more formulas and what this tells us is the time evolution of the citation
Network so so right we’re back to the subject we started talking about at the
beneath the lecture now we wanted to know about how the network evolves well
this network is evolving all the time all the time people are publishing new
papers in those papers cite previous papers and so forth so this is precisely
an example of a network that’s evolving over time and prices theory if it’s
correct and it seems to be pretty good tells us how this thing is evolving over
time so we can do some other calculations and calculate for instance
the following we can calculate if a paper is published at a particular time
how many citations do we expect it to have by now and when we calculate that
the result looks like this so horizontal axis time the beginning is you know the
beginning of the field whenever the field started and the end is now the
present day so look at the black curve first you’ll notice that curve goes down
to zero at the present day that’s because a
that was published yesterday doesn’t have any citations yet nobody’s cited it
because it’s only just been published so we expect the curve to go down to zero
there what’s surprising is this huge spike that you get as you go back in
time towards the beginning of the field there’s the papers that are published
early in a field have enormous Lemoore citations there are referred to many
many more times than papers are published late and remember that this is
nothing to do with their content because content doesn’t even come into this
theory so you might think that well in that case this isn’t a very good theory
but here again are the real data the purple stuff in the background is the
real data for actual papers and the agreements not perfect you know there’s
some wobble there but it’s pretty good in fact this line pretty well predicts
the number of citations that real paper so this is this is real papers now and
you see that there is this effect this huge spike if you published a paper
right at the beginning of the field then you get enormous Lemoore citations the
people who came along only a little bit later this is something we call a first
mover advantage in this case Illustrated for citation networks but it turns out
that this can apply to all sorts of other things as well the basic thinking
behind the first mover advantage is the following that if you’re one of the
first people to publish a paper in a field then somebody comes along after
you and publishes another paper they’re gonna have to cite you because you’re
the only game in town there isn’t anybody else to cite so it may be that
your paper isn’t very good and maybe your paper isn’t very relevant it
doesn’t really matter you know only three other papers have been published
on this subject and I have to cite something so I’m gonna cite those ones
right so those early papers published in field get a lot of citations just goes
to their very only game in town which means they get an early lead over other
later papers and then this preferential attachment effect kicks in and says the
guys who have the most citations they get the most extra ones how many you get
depends on how many you have so once you get that early lead this preferential
attachment thing amplifies it so that you always have a lead and eventually
you’re way out in front of everybody else and it really doesn’t matter what
you said in your paper or whether it was relevant you still get that early lead so none of this depends on the actual
content of the paper any questions and and this is sort of supported to some
extent by the fact that the early citations in a field are often between
relatively unrelated papers so in some sense this is kind of a depressing
result it sort of tells us scientists that that really what you need to do if
you want to be well sighted is that you need to it’s much better to write a
mediocre paper in tomorrow’s hottest field than it is to write a really good
paper in today’s I I don’t entirely believe that I certainly believe that
there is an effect of paper quality here in particular you know how do you get
that early lead the first one to get in the lead well part of that is by writing
a paper that people actually pay attention to and are interested in there
may be other things that work as well and being early is certainly extremely
good I actually have a friend who who tells me he exploited this idea so he
somebody told him about a new field that was coming along we thought was really
exciting and he wasn’t that interesting it we said well I’m gonna write a paper
in his anyway because then it’ll be one of the first papers and now he tells me
apparently lots of people have referred to it so so maybe you can use this so so
so and this matches the data very well as they say so you know this is kind of
a slightly esoteric example this whole citation you know who really cares about
how many citations paper gets maybe a few scientists do but there’s no real
reason why you should but as I say you can apply this to other things as well
there was a really nice example of this effect there’s a guy called Duncan Watts
who used to be a postdoc at Santa Fe Institute back in the 90s
he’s now principal scientist at Yahoo researched the Internet company and he
did a study of downloading music on the Internet
so what he did for this study was he set up a website where participants could
download music and the music was music by relatively unknown bands who were
happy to donate their music because they wanted the publicity so he set up this
website people could download this music but the website was very simple and
didn’t give you much information so all it gave you
for eat so for each song that you could download it gave you the name of the
band it gave you the name of the tune and crucially it gave you how many
people had downloaded that particular song before so given that these are
unknown bands and people can really have any way of choosing which of the good
songs download which is bad songs download it turned out that what a lot
of people did is they just went for the ones that have been downloaded a lot
before they say oh I see this song has been downloaded by a lot of other people
must be a good song I’m gonna listen to that one as well so this is price
precisely this preferential attachment effect the things that have more
downloads get more downloads so if for whatever reason but it’s precisely this
sort of amplification effect so what indeed they found was that you end up
getting this power law this distribution where most of the songs only got a small
number of downloads but there were a few hubs out there in the tail that really
got a lot okay so maybe that isn’t that surprising but now they did a really
clever thing they lied about the number of downloads
they changed the numbers so they didn’t actually reflect how many times those
songs have been downloaded before and they still found that the ones with the
biggest numbers got downloaded the most and ended up way out ahead of all the
others so this tells you this really directly tells you that there’s no
effects of the content of the songs or whether they’re good songs or bad songs
you can make any song the one that gets out in the lead and has this huge
advantage over all the others all it has to do is get a little early lead in the
downloading which you can give it give to it for instance by lying and then and
then that’ll be amplified by the preferential attachment effects and end
up it’ll end up getting you know hundreds of downloads and you can do
this for any song so maybe this is sort of telling people something that’s
telling tell you that you know it doesn’t really matter if your product is
no good all you’ve got to arrange is for it to get a little early lead over the
competition and then this effect is gonna kick in
and you’ll get way out ahead of everybody else now to be fair one of the
ways that you can get that early lead is to have a product that’s actually good
and people like it but there are probably other ways as well you know you
can have good advertising right so so the the bottom line is
really all you need to do in order to have a successful song or something like
that is to get that early lead and you can do that however you want be it by
having a good song or any other way okay very good so this is one example of a
network that is changing over time the other the sort of flip side of this
dynamics is the question of networks that stay fairly constant but there’s
some dynamical process going on on the network so I want to spend little time
talking about that other side of things as well and I want to talk about one
particular example of a process that goes on in networks it’s one that I’ve
worked on quite a lot which is the spread of disease over networks now I
talked about this a bit in the previous lecture as well and it’s obviously a
very important application of this network theory thing so one of the
things that we’ve worked on is understanding how the patterns of
connection between people affect the spread of the disease so you’re
interested in some particular disease and here’s this little picture you have
in your head you have some network of people that connected together and one
of those people get sick represented by this red node in the middle and then
that person can give the disease to their friends and they can give them to
these to their friends and to their friends and so forth so the disease
spreads over the network like this so what this is telling you is if you want
to understand how the disease spreads it’s absolutely crucial to understand
what the structure of the network it’s different Network different spread of
disease this picture however is not really entirely accurate this is rather
oversimplified picture what’s really happening is something more like this in
the real world you know if I get the flu I don’t necessarily give it to all of my
friends I might give it to some of them but there are some others who won’t
catch it from me it’s not a hundred percent certain that a person you know I
might not see some people this week or if I do you know maybe I’m not sitting
close enough to them took off on them and give them the flu so we can think of
this as modifying the network now some of the network the edges are Bowl
here these black edges those represent the ones where I actually saw that
person and I was sitting closer than enough to them and so forth edges
sufficient to transmit the disease and the gray ones represent all the others
these are the ones where you know maybe I didn’t see the person that week so now
what happens is if the person in the middle gets sick they give the disease
to some of their neighbors but not others this person over here escaped
getting the disease and then they pass it on to other people and they pass it
on to other people you notice in the end in fact this person did end up getting
sick but not from the first person the disease had to go around another way so
this is a much more realistic picture of a spread of disease on a network and in
fact this is the basic picture that a lot of calculations of the spread of
disease on their work are based on but the nice thing about this picture is
that really this network here is still just a network if I just forget about
the gray edges here and just think about this as being the black edges well
that’s just a network cell the disease is still just reading on it or he’s not
the same network that I started off with it’s a different network I’ve modified
it a bit to allow for the fact that not all of the edges transmit the disease
but it’s still just some network and so I still just have the disease spreading
on a network so actually I can account for this idea that there’s not a hundred
percent problem with your transmission of the disease pretty easily just by
modifying my network appropriately this kind of modification where I you know
color in some of the edges and not others is called a percolation model and
that’s one of the fundamental models that’s used in the mathematical study of
the spread of disease very good so now the point is so we have this sort
of mathematical model that we can use to simulate the spread of disease over the
network and now the point is we want to understand how the structure of the
network is going to affect the spread of disease because different shapes of
networks are going to make the network disease spread in different ways and we
want to understand the connection between the structure of network on
disparities so take for instance this network here that we saw in the previous
lecture this is a network that came from a study of the spread of tuberculosis
which is increasingly becoming a serious health risk because of antibiotic
resistant strains so what this network shows is the nodes
are people in the study and the edges represent who’s been in close physical
proximity with whom recently because that’s what you need in order to
transmit TB TB is transmitted by one person coughing on another or sneezing
on another so you have to be close together and what you see one of the
things you see when you look at this network is that there are again some
hubs there are these nodes in the network that have high degrees this is
again the long tail effect that we were just talking about there’s a small
number of nodes with a lot of connections and you can certainly
imagine that that could have a substantial effect on the spread of
disease for instance if this person in the middle here got sick they could give
the disease to a lot of other people that would be bad so now we say well
what properties is it of the network that controls the way the disease
spreads if if you had to make a first simple guess maybe the sort of thing you
would guess is well I guess that for instance the average degree of a node is
an important thing remember the degree is the number of connections you have
and if on average nodes have a lot of connections then the disease is going to
spread relatively quickly and if on average they only have a few it’s not
going to spread so quickly so actually that’s a pretty shrewd guess that’s
basically the right idea but it’s actually not exactly right it turns out
that it’s not the degree of a person that matters but the square of their
degree in other words take the degree and multiply it by this by itself I can
give you the sort of hand waving argument for why that is so consider
this person in the middle here again they’ve got you know maybe a hundred
friends that means that if they get sick there’s a hundred people they could
spread the disease onto so it’s a hundred times more effective at
spreading disease but this person here is also a hundred times more likely to
get the disease in the first place because they’ve got a hundred friends
right there are a hundred people they could catch it from as well as a hundred
people they give it to so there are hundred times more likely to catch the
disease and a hundred times more likely to pass it on so there are a hundred
times a hundred ten thousand times more effective at spreading the disease that
the person who only has one contact so it’s the square of the degree that
matters and what that means is that the hubs in the network have an enormous
influence in fact in many cases they are the primary driving force that affects
the way the network behaves a person who has a thousand contacts is a million
times more effective at passing on this disease than the person who only has one
so the hubs are really very important let me show you a practical example of
that in action this is real data from the spread of SARS severe acute
respiratory syndrome which is a viral form of pneumonia that made the news a
lot in about 2002 2003 when there was a big outbreak of it in Southeast Asia
these are data from the outbreak they had in Singapore where they have very
good data about who caught the disease from whom this is from a study that was
done by Steve Bugatti what these data show is here’s where the outbreak
started was patient number one and unfortunately patient number one was a
hub in the network so they passed the disease on to many other people most of
those people are just ordinary regular people in the network nothing special
about them they don’t have that many contacts so most of them didn’t pass it
on to anybody else or maybe pass it on to one other person and then it would
have died out so the disease would have died out right there and then if it
hadn’t been for the fact that one of the people the person one passed it on to
was another hub that gives it another big boost again the disease would have
died out because most of the people number six passed it on to also didn’t
pass it on to anybody else but one of them was another hub another big boost
and so forth and so what you can see is these hubs are solely responsible for
keeping the disease alive right if it was just all the regular people with
only a few contacts their disease will just die it died out but these hubs even
though they’re very small in number are extremely effective at passing the
disease along and if so they’re essentially solely responsible for
keeping the infection alive if they were not there the disease would die out very
quickly well this then maybe gives us another
idea what if we could get rid of just those few hubs in the network if we
could just get rid of them then the disease wouldn’t be able to spread and
it would die out so this leads us to think about the
idea of vaccination of populations removing notes from the network you
could so one of the ways to prevent the spread of a disease is to vaccinate
people if you have a vaccine right when you vaccinate people you are effectively
removing those people from the network I mean of course they’re still there in
the network but as far as the disease is concerned they’re not there anymore
because the disease can’t pass through them it can’t infect those people so if
we do a sort of your standard vaccination campaign which is basically
just that you kind of vaccinate people at random you know anybody who comes in
and wants the vaccine you give it to them well if you vaccinate people at
random then most of the people are going to end up vaccinating are going to be
not the hub’s right because the hubs are only a tiny fraction of the population
almost everybody in the population is just a regular person so if you’ve
accident use people at random you’re going to end up vaccinating regular
people not the hub’s so here’s an example of a small network i vaccinate
this person and this person so they’re effectively knocked out of the network
so now the network looks like this but it’s still basically connected together
right the disease can still spread through this network but now if we take
the alternative approach and we vaccinate the hubs in the network we
deliberately look for the ones that have many connections knock them out still
only the same number of people getting vaccinated but now we’ve completely
disconnected the network and now the disease can’t spread so this leads us to
this idea of targeted vaccinations if we can only find the right people to
vaccinate the right way of attacking this network then we can be very
effective at controlling the spread of disease with actually surprisingly
little effort we don’t need to vaccinate that many people in order to prevent the
disease from spreading this kind of target attack on network can be
extremely effective if we take real data for a typical social network and then
simulate the spread of a disease can do a computer simulation to spread a
disease we find that you’d need to vaccinate
about 95% of everybody in order to knock out enough knows that the disease can’t
spread anymore on the other hand if you simply target the nodes that have the
highest degrees the hub’s in the network then you only need to remove about 20%
of them much more effective so you you’d have to make much less effort and
take take much less vaccine and you can still prevent the spread of the disease
this effect is called herd immunity that’s the name that is an
epidemiologist give to the idea that your you can immunize a small fraction
and that effectively gives them you energy to everybody else however there
is a catch here and the catch is how do we find these people with the highest
degrees because we don’t have a map of the social network you know there’s no
map of the social network of Singapore where you can just go and look them up
and pick them out of a map oh look this person there’s a lot of people this
person there’s a lot of people well there are various ways one could do this
but one particularly nice suggestion that I like is that we can use the
network itself to find the people with many connections this idea was put
forward by Reuben Cohen and Shlomo havlin they call it acquaintances
immunization and it works as follows so in your typical vaccination campaign
you just choose people at random and you vaccinate them and that doesn’t work
very well in their suggested way of doing this acquaintance immunization you
choose people at random but then you don’t vaccinate them instead you ask
them to name one of their friends and you go find that friend and you
vaccinate them instead so this sounds like a crazy idea but actually it’s not
actually it’s really smart because if you think about it suppose someone has a
thousand friends then they have a thousand opportunities to be nominated
by one of their friends as the person who should get vaccinated person only
has one friend and here one opportunity to be nominated so the people who have
lots of friends are much more likely to get vaccinated in the scheme and the
people who don’t so it’s doing exactly the right thing
it’s vaccinating the high degree people and leaving the low degree people alone
so they did simulations of this on a computer you know simulate the spread of
disease over a computer and it turns out that this is actually an extremely
effective way of doing it with a lot less work a lot few vaccinations you can
prevent the disease from spreading as far as I know this has not yet been
tried out in practice with a real disease but maybe it should be maybe
this will be better than the kind of things we’re doing at the moment okay so so this so this is a particular
example of what I would call resilience or robustness of a network so the
crucial question here is how resilient or robust is this network to this attack
we’re attacking it by vaccinating some of the nose to remove them from the
network and in this particular case we want the network to be not very robust
but there are other cases where we would like networks to be robust take the
internet for example we want the internet to work now there are nodes
that fail on the internet all the time this is a regular occurrence but we
would like to engineer it in such a way that when that happens then network
still functions pretty well and you can still you know get to your favorite web
site at whatever it is so there are some networks we’d like them to be robust in
there are some networks like this one we’d like them to be not robust but
actually kind of the same mathematics applies to both cases and this is a big
topic of study currently in network theory so an interesting question here
is what is it about networks that makes them robust or not robust can we tell by
looking at the structure of network whether it’s going to be robust there
are various different things that affect this robustness of networks but one
interesting one is something that we already saw if you were here at
yesterday’s lecture and that’s this idea of her muffle II remember this picture
from yesterday which is a picture of friendships between school students in a
u.s. high school and what we saw is though the way the black and white nodes
are the black and white kids in the school and mostly the white kids are
friends with other white kids and the black kids are friends with other black
kids they tend to be friends with people of the same ethnic group as them so this
is the effect that we call term awfully similar nodes connect this together I
want to talk about another variation on this idea I’m awfully by degree this is
okay this is kind of getting sophisticated now we’re going to combine
two of our network theory ideas together to make a combined idea this idea from
awfully knows light to be connected other nodes that are the same as them
and the idea of degree how many connections do you have her mostly by
degree means that knows what a lot of connections like to connect to other
nodes that have a lot of connections so the gregarious people are hanging out
with gregarious people in the hermit’s are
hanging out with other Hermits okay here is an example of looking at that from a
paper by some folks about 10 years ago what this plot shows this is Internet
data and the horizontal axis is number of connections degree number of
connections that note on the internet have vertical axis is how many
connections to your neighbors have so what this is showing is that if you have
very few connections you’re down here then your note neighbors have a lot
whereas if you have a lot of connections you’re out here then your neighbors have
very few so this is actually sort of an T homophily I guess Petr awfully would
be the correct word this is actually the the party people hanging out with the
hermit’s and the hermit’s hanging all the party people not the other way
around but you can get both of these things in fact here’s some simple
measurements so one thing that we came up with it was just a simple number a
correlation coefficient that tells you whether tells you which way it’s working
all the high degree nodes connected to other high degree nodes or they connect
to the low degree nodes this number here is going to be positive if it’s party
people hanging out with other party people and it’s going to be negative
it’s if it’s party people hanging out with Herrmann’s so here’s the internet
for example and you see a negative number that’s what we just saw the high
degree nodes are connected to the low degree nodes and vice-versa
these ones at the top are all social networks they all have positive numbers
so the party people are hanging out with the party people in in that case there’s
an interesting split between these two all the social networks it’s always
positive it’s always people with a lot of friends hang out with other people a
lot of friends Hermits hanging out with other Hermits in all the other networks
we’ve looked at the internet worldwide where biological networks and so forth
it’s the reverse the Hydra we know is connected to the load of good notes why
do I care about this well it turns out that this has a big effect on the
structure of networks here’s what happens if it’s positive in other words
if it’s the party people hanging out with other party people you get a
network that looks like this this is not a real network I just generated this one
on my computer to give you an idea what it looks like you get this sort of dense
score all the party people are connected to other party
people so you get this dense core of people all connected together who have a
lot of connections and then you get this thinner sort of periphery around the
edge which is the people who don’t have so many connections on the other hand in
the negative case the party people are hanging out with the Hermit so you get
these sort of star like structures see here’s a party person and there’s all
the Hermits that they’re hanging out with ok so what this means now is that
the party people the high degree nodes and now strewn all over the network
they’re not concentrated in a group in the middle in the core like they were
they now spread out all over the network because they don’t want to be connected
together anymore so you get quite different structures for these two
networks and it turns out that this has a big effect on the resilience of the
network now suppose we want to vaccinate people against the spread of a disease
so we’re talking about a social network social networks are over here on the
left hand side they have the positive numbers the high degree nodes connects
to other high degree nodes if we so we want to vaccinate the people that have
the most connections so we vaccinate a bunch of people who have lots of
connections but all of those people are concentrated here in the middle of the
network in the core in this particular case so what we’d end up doing is just
vaccinating people here in the core and we’d so effectively cut a little hole
out of the middle of the network but the rest of the network around it would
still be just fine so in this case this is actually not an effective strategy
for vaccinating another way of saying that is that this network is very robust
against that particular attack that particular way of vaccinating community
this is bad news because this is social networks over here on the positive side
we want the social networks to not be robust but actually it turns out they
are robust over here on the other side we have things like the internet the
negative ones but now these ones turn out to be very fragile because their
high degree nodes are spread all over the place
knock those out and you’ve knocked out all the crucial centers all over the
network and the network falls apart very quickly so we actually have exactly the
opposite of what we’d like to have this kind of bad news picture really the
social networks it turns out are very robust against these kinds of attacks
and the internet is very fragile against them we’d like this one to be robust and
this one to be fragile but it’s the wrong way so I’ve just got a few minutes left
in the last few minutes what I’d like to do is just tell you about a few of the
things that people have done with this I these ideas of using networks to
understand the spread of disease this has now become a big industry this is an
important topic a lot of people interested in and people have done some
very sophisticated things towards understanding how the structure of
networks affects the spread of these one example is actually a local example here
in New Mexico something called epi Sims which was created by people working at
los alamos whatever Sims does is it’s a simulation of an entire city in fact
it’s a simulation of the entire city of Portland Oregon it includes every person
every vehicle moving along the roads the entire road plan of the city every
building in the city it’s an incredibly detailed micro
simulation that incorporates all the things that are going on in the city and
then they use the city this simulated city to simulate the spread of a disease
so what they do is they started C spreading somewhere in the city and and
you know they actually follow all the individual people moving around they go
to work or they go to school or they go into whatever they do and who catch the
disease from whom and they’re able to predict in great detail how the disease
is going to spread through the city one of the nice things about this is that
they can now change anything they like about it
you know let’s say we have you know a curfew let’s say we vaccinate this half
of the population let’s say a bunch of people leave town you can just try all
of those things in your simulation and see what effect it has on the spread of
the disease so the idea behind these kinds of very detailed studies is that
it enables us to try a lot of different experiments about what a good strategies
for controlling the spread of disease before it actually happens you don’t
want to be experimenting with different strategies while an outbreak is actually
going on this allows you to just do it on the computer
try various different things and work out what’s most likely to work there’s
no claim of course that this is an exact representation of the world but they’ve
done a lot of work to make it as accurate as they possibly can this is
the spread of disease on a relatively small scale the scale of a single city
there have also been studies done where people look at much larger scales if
you’re looking at the spread of disease over the whole world then it turns out
that the crucial factor in the spread of disease is airline traffic it’s people
getting on airplanes who are sick and flying to other places and when they get
there they give the disease to somebody else airline flights are the primary
vector for carrying diseases around the world airlines are basically just big
metal tubes flying through the air full of germs carrying them to other places
so they do so this is from work by Vittorio Kalia and colleagues that
Indiana University they have simulations of individual cities and then they have
the complete map of all the airline flights going on in the entire world
when those flights take place how many people on average fly on each of those
flights every day and then they do a big computer simulation they say suppose we
have an epidemic starting for instance like SARS did in Southeast Asia and they
can predict when that disease is going to get to some other country some other
city they can predict how bad it’s going to be when it gets there how long it’s
going to take and so forth so these kinds of simulations now allow us to
make really quite detailed predictions about the large-scale spread of disease
they actually did some predictions about the spread of flu they predicted how
long it would take one year’s flu to get to various different countries and then
after the fact after that year’s flu outbreak they went back and compared
their predictions to what to what actually happened in the real world and
it turned out they actually did pretty well of course they didn’t get
everything right but it was actually surprisingly accurate so I think there’s
good signs that these kinds of techniques
could be very useful you could do exactly the same things with this as you
did with the Portland study you could try out different techniques for
preventing the disease for spreading what happens if we vaccinate all of
these people and so forth you can try different things before it happens and
work out what’s going to be a good strategy for preventing the spread of
disease so I’m very nearly out of time really the last thing I want to do is
just point out that all of this stuff that I’ve talked about in the last three
lectures is in many ways just the beginnings of the field we haven’t been
working on that this long on this that long and there are many things that we
still don’t know we would certainly like to get better at measuring networks we
don’t like to understand how they change over time the kinds of things I’ve been
talking about today understand how a changing network can change its
performance you know if we if we reroute the network if we change it in some way
how will it affect the behavior we’d like to get better at predicting Network
phenomena the kind of things that I talked about yesterday predict how a
society will react to something happening we’d like to be able to
prevent disease outbreaks by knowing what the best strategy is for
vaccination and so forth these are all good questions but there are many many
more that I haven’t had time to talk about of course the things that I’ve
been able to tell you about are only a small portion of this whole subject
there are now thousands of people working in this area of network theory
an awful lot of interesting work it’s a very active area but I hope that I’ve
given you a flavor for the sort of things that we’re interested in I’d like
to say thank you very much to all of you for coming here it’s it’s been a
pleasure I’d be happy to answer questions if you’ve got them and thanks
for your attention yeah question down the front here there
are two parameters yeah so one is the average number of citations that the
paper makes I mean how many do you cite which is something that you can measure
in the data and the other one is is slightly more complicated so there’s a
catch here which I didn’t mention which is that if the number of citations you
get is proportionate number you already have well when your papers first publish
you don’t have any yet so you wouldn’t get in so that means no paper would ever
get any because they all have 0 when they start so there’s kind of a catch
there so there’s little Cluj there what he did was he said well we’ll make it
not actually proportional but it’s proportional to number you have plus
some other number of free citations so there’s another parameter there which
basically what it represents is how fast is the paper get cited when it’s never
had any citations in the past and that that parameter in principle could be
dependent on the quality of the paper for instance it’s something which could
be higher for good papers and bad papers so there are those two parameters in
there and what I did in those calculations that I showed you is I
started by I got the values of the parameters by fitting it to the overall
distributions of citations and then he used the same values to predict that
first-mover effect so although there are two parameters in that calculation it’s
effectively a 0 parameter fit because I’ve already fitted the parameters to
the overall distribution so sort of yes and no is the answer question yes so a good question so yes I admit I have
looked to see how many people have cited my papers and and yes absolutely it
seems to be true of my papers well so I so within physics it is correct that I
was one of the first people to get into this subject and I think that I’ve
certainly benefited from that so yeah maybe I’m and I think perhaps I’m not
very good at this stuff at all I just happened to be there first right yeah so good questions the question is
do do non-infectious diseases also have network effects and and so yeah I think
so too things like it’s just one is for hereditary diseases then there is still
a network in play and it’s basically the genealogical network it’s the network of
who’s descended from whom and certainly there are network effects there the
other is though even with things like obesity and alcoholism there which a
that there are there are indications that those are socially transmitted so
there’s a series of very interesting studies by James Fowler and Nicholas
Christakis where they had social network data and they had data on things like
obesity smoking alcoholism stuff like that and they found that people were
much more likely to be extremely overweight if they had friends that were
extremely overweight and so forth so there is some evidence I’d have to say
that evidence is still somewhat controversial not everybody accepts the
results but certainly this is something people are interested in you know it
seems like a reasonable idea that obesity is something where you know if
if if your friends all eat too much then maybe you eat too much as well I mean it
doesn’t seem an unreasonable idea that this could be a socially transmitted
condition so yes absolutely people are looking at that yes absolutely yes so basically if the curve
is flatter meaning that there are more hubs out there then it tends to be more
fragile if if if there are hubs out there which really have a lot of
connections you know huge numbers of connections then removing just one of
those is very devastating for the network whereas if the curve is steeper
and the highest-degree hubs don’t have that many connections then removing them
still important but it doesn’t have quite such a big effect so absolutely
you’re correct it certainly depends quite strongly on sort of what the slope
of that curve is yes another question down there what do you mean by context
can you give an example okay so yes so very good
so absolutely those things do affect things like the spread of disease there
are some there are many good non network reasons why things like poverty affect
the spread of disease people have poorer health care and so forth poorer
education but there can also be Network effects there because people who are in
different situations also have differently structured networks so one
of the things that you notice for instance is that poorer people tend to
have more geographically local networks and rich people tend to you know fly on
airplanes engines all over the place so then networks are more spread out and
this can have a big effect on the spread of disease actually that in a mostly
local network diseases will spread mostly locally in a disease with these
long range connections disease can spread very quickly to places far away
there’s actually an interesting contrast to be drawn here between modern-day
networks where the transmission of disease is dominated as I said by
airline flight people getting on airplanes and taking diseases of
different places and what happened centuries ago before we had long-range
travel if you look at something like the Black Death in the 14th century which
was a form of hemorrhagic fever there was a big outbreak of it in Europe
starting in 1347 it actually arrived in Europe first in Constantinople and then
spread along the Mediterranean I mean trade ships and then moved northwards
through Europe and I’ve seen maps of this where the disease there were the
sort of contours every six months the disease is just advancing a certain
distance so it’s moving geographically and that and the reason for that is that
people in those days only knew the other people who lived geographically close to
them nowadays you see this long range transmission where you know a disease
can start in Singapore and within a week it can be in Toronto in those days it’s
very clear that it was sort of this geographical transmission where it just
moved it was a couple of hundred miles every six months so a completely
different kind of transmission of the disease indicating a completely
different underlying structure of the network over which the disease is being
transmitted yes question there so absolutely not first forth so so what
I’m saying is that with things like that there’ll be this preferential attachment
law it appears that the important thing is to get that early lead and there are
various things that can get you an early lead one of them of course is just
writing a paper that’s really good and everybody takes notice but another one
for instance is getting your paper published in a high-profile journal some
something that people take notice of and that’s of course a part of this peer
review process the you know referees can say no this paper is suitable for a
better journal or a less good journal and that’s one of the things that
certainly affects how much people notice so it I mean it’s certainly true that
all the various things to do with the quality and content of a paper play into
how well it will get referenced but they’re mostly playing into it via the
question of whether it gets this early lead because really once you’ve
established the beginning is this a paper that’s going to get cited or not
get cited after that this preferential attachment thinks it seems to take over
and just amplifies whatever you started off with yes question back there right so that particular study that was done
by Kalia and collaborators was a study of SARS and so they the SARS outbreak in
2003 started in southern China so they chose that as the sort of epicenter for
that particular simulation and and then we’re trying to predict you know how the
disease would have spread from there and they compared their predictions against
what happened in the actual SARS outbreak
for instance when it came to North America it particularly affected Toronto
and Vancouver and so they wanted to sort of compare that they did that
deliberately so they can see how well their simulation method was working
compared to what the real SARS did however it’s absolutely the case that’s
if it were some other disease that we were thinking of then it might start in
some other part of the world they certainly haven’t done studies for every
other possible starting place but it’s exactly the sort of thing that you would
want to do if you want to understand how a disease starting one kick place can
reach other places yes question I mean crazy creative network somehow
yeah so possibly I mean so there certainly have been many advances in
recent years where people have created networks between people that didn’t
exist before I mean the ones that get the most publicity of course online
things but there’s no reason in principle why you can’t go about
creating networks like this of course social engineering of that kind is a
tremendously difficult topic it’s not something that we understand how to do
well sometimes people try to engineer a network and it fails miserably and other
times they don’t really know what they’re doing and yet it’s a wild
success we haven’t learned yet how to make this these things predictably work
well but but certainly you know this is the kind of thing that people have tried
in the past and sometimes it works so so in principle it’s possible but I think
that there’s a big gap in our knowledge there about exactly how you go about
doing that maybe one more question yes a question over there uh-huh yeah that’s that’s a good question so as
far as I know no they haven’t in particular actually I don’t think that
was something they considered when they did that airline simulation study the
way that worked was you know there were people in the city and some of them had
the disease and then they flew to the other city and gave it to other people
in the city but you’re absolutely right now you mention it I’ve had the same
experience I think that you know airplanes are pretty bad places for
catching the flu from other people and stuff like that
yo yo yo hope you’re stuck in this little box in the sky for several hours
with a bunch of other people and you know circulating air and so forth
so so yeah you’re right I wouldn’t be that surprised if that was an important
vector for the spread of disease and I actually haven’t seen that considering
these studies so maybe that’s an element that’s missing from them at the moment
okay I think you’ve done a brilliant job nodding and giving a lecture but in
fielding the questions and responding so let’s give Marc a very hefty hand for
the cleaner

One Reply to “Mark Newman 3 – Using Networks To Make Predictions”

Leave a Reply

Your email address will not be published. Required fields are marked *