Live stream: Optimization for Machine Learning


complexity and of course we also have to
think about how we’re actually going to do these improvements and how we
actually resample our data intelligibly and so I think all of these areas are
very interesting research areas to look into today I’ll just talk about two of
these areas so I’ll talk about some work we looked at in sort of these semantic
liesl distributions and some on interpretability four in the model
complexity area okay so it’s just an outline so first I’ll talk about these
two projects and then afterwards I’ll talk a little bit about my vision for
future work so in this first project we’re looking at dynamic word
representations and it’s used in data mining applications and this was a sort
of summer project work we did at Technicolor okay so if we’ve ever worked
in natural language processing we kind of know that you know there are a lot of
different tasks in natural language processing whether it’s text
summarization or mesh or machine translation or sentiment analysis and
they can kind of be broken down to small subtasks like you know part of speech
classification for example and all of these applications we have to sort of
find a way to take words as inputs and something as output and so how do you
actually take a word and represent it as a number it’s not really clear right so
the easiest way to do this is to take a giant vector which is the size of the
number of words in the English language and say okay that vectors are zeroes
with a 1 if that word is being selected and 0 otherwise and so we can create a
lot of problems involving here right so it’s for example let’s just say that we
wanted to use a neural network classifier then the first layer of that
neural network has 170 so it’s first first dimension in terms of the number
of parameters and then it sort of reduces down to something like 20 or 30
at the end so something like this is very hard to train and even if you could
train it it’s going to over fit quite a bit so it’s really not an ideal thing to
do so what do people do people use word vectors so what our word vectors they’re
basically ways of representing words in low dimensional space and so actually if
we look back over here this seems like a really strange way to represent words
because this vector in itself has very little like
degrees of freedom right we have we just sort of saying that everything has to be
all zeros with one one it’s not it’s not really using the entire 170,000
dimensional space very well so if if instead we take these words and we
represent them in a much lower dimensional space but in a word that
geometrically captures some language structure we might be able to do a lot
better and so these one vector is what they’ll actually produce is Lorde’s
living in something like 50 dimensional or 100 dimensional space where or is a
similar meaning cluster together and you also get interesting algebraic sort of
structures for example the orientation of the vector of man – woman is the same
as uncle – aunt and king – Queen and so these sort of things make you make you
feel like these vectors are actually somehow geometrically capturing language
structure so why is that useful now we actually take these word vectors and we
put them through a narrow Network this neural network becomes much easier to
Train and in fact the performance will become much better and so you see it
quotes like so the use of word representations is called a secret sauce
for the success of many NLP systems in recent years across tasks so you know
named entity recognition sentiment analysis all of these things across all
these tasks just by replacing the word with the word vector you can usually do
better okay so we’re actually going to take this a step further and use these
word vectors for something slightly different so we know that word that
words in general are not a static quantity right they’re changing over
time they have sort of emerging meanings and dying meanings right so old words
fall out of use new words come into use and we also have a lot of words like you
know names of brands or names of people that are constantly changing their
Association as that random person that changes Association and so actually this
is one of the big motivations of this work which is if I McDonald and I know
that in 1960 my company was associated with family-friendly environments and
like nice you know dinners going out and in 2012 it’s associated with obesity and
hypertension I really need to know this so I can keep rebranding right and same
with politicians so this handsome gentleman to the bottom here is Barack
Obama or friendly President Barack Obama long before he was president and so you
know people you know so keeping track of these associations
can actually be done through these word vectors and it actually can help them
with their tasks so okay so the goal of this first project is asking can we
actually track word vectors through time to show a concept evolution okay so we
have to take a step back now and actually to talk about how we’re
actually computing these word vectors so let’s first look at the static case so
the case of static vectors the way we do it is we sort of form what’s called a
co-occurrence matrix so the co-occurrence matrix basically you take
a really really long corpus and then you go through and you do some
pre-processing so you discard words that are appearing too often but really
aren’t that informative and you sort of describe like railroads that don’t
appear much at all and then you use a sliding window and just slide through
the entire corpus and every time two words appear in the same sliding window
you add a count so then you form this sort of sort of second order account of
everything going on in the in the corpus and the assumption here is that this
co-occurrence matrix acts as a similarity matrix so there’s this famous
code in linguistics like you shall know a word by a company it keeps so words
that sort of fall like where that are near other words in space are somehow
similar to each other in meaning so if I’ll just take this matrix just the way
it is I do some kind of factorization on it for example I take the top singular
vectors of it already I can do a lot of amazing things and so this this roll
result is from 1990s they just took you know a fairly sizable corpus and then
they just did an SVD on it and already they can do a lot of clustering like
parts of the body animals and places with like some mistake so tooth is sort
of in the wrong cluster but they actually do very well just by that very
simple procedure so this tells us that maybe this assumption that words appear
together there are similar is actually probably very true so how has the
technology changed since then well one issue that people notice is that the
English language is highly unbalanced so we have a few words that appear almost
all the time and most words don’t appear much at all
and so this makes the co-occurrence matrix kind of unstable it’s not very
like you know you have some words some numbers that are way too big and then
everything else gets squashed okay so we just do the typical tricks we usually do
in this kind of scenario we do some waiting we divide by the frequency and
then you know we did a typical trick we do and we don’t want big numbers to be
too big we just take the log all right and so now it looks kind of like a
mutual information pairwise mutual information matrix and then one issue
people realize when they actually started working with this matrix is okay
so if I have limits that basically don’t occur at all it’s basically zero so I
look at this weighted co-occurrence matrix that values close to zero that’s
fine I take the log of it becomes something
negative very very very negative and so that actually starts throwing things off
so when you barely barely have enough samples you have very very large numbers
and so what they do is they just threshold it at zero so in reality what
you’re really working with is the PPM fly matrix so the positive paralyze
mutual information use matrix and this is what we’re going to use to represent
similarity between words okay so then take this matrix and I do some
kind of factorization so I do some way of making it into this sort of uut
structure and then this matrix here each row of this matrix is then my word
embedding and this this technique is actually pretty much how most word
embeddings are created in some way or another right so sort of to to get to
the solution it’s the solution of a non convex optimization problem so how you
actually go about solving it will affect the quality of your solution but
ultimately each of these different methods they’re trying to get to this in
a different way so you know in the past they look at these counts matrix but
more recently you look at the PP MIT trick so whether you’re doing an SVD
whether you’re doing a shallow neural network or whether you’re doing the
squares problem they’re all sort of trying to do this kind of matrix
factorization and they they will actually get different performances on
different tasks because they are using different ways of getting there so
that’s how the static cakes works and now what we’re trying to do is were
trying to do concept evolution right we’re trying to understand how we’re
changing meaning over time so the simplest way it seems to do that is we
take this corpus and we slice it into time frames and for each time frame we
just do the same thing right we just do the same thing we did for the static
embedding so we do some kind of factorization we look at these vectors
and we see what happens okay so unfortunately this will not work
so what’s wrong with why won’t this work so there’s some kind of physically
rotational variants when it comes to matrix
factorization so basically you can think about it as if I told you this the
distances between all the different nodes in a configuration that tells you
the configuration but it doesn’t tell you the rotation or the translation
right so if I take two different time frames I do this kind of matrix
factorization the meeting may not have changed but the vector will change quite
a bit if I don’t account for this rotational invariance I suspect over
here jumped over here but in fact if the meaning didn’t change so dynamic where
representations most of the works they differ by basically telling you how they
actually go about enforcing this smoothing so dealing with this
rotational invariance that’s basically that the main mathematical part of it so
for example so the first so I’ll go through a couple different examples one
example the first way they do it is they basically said okay give me the word you
want to look for and I’ll just do that kind of rotation in time so what they
have is they have one corpus which is very well sampled and that’s their base
time corpus and then now you give me a new year so instead of cat in 1990 I
went to like a cat in 2016 I look at the word cat and I look at the words close
to that so maybe dog you know turtle pet all these different things and then I
just take this configuration here and rotate it back in real-time and the hope
is that for the most part this configuration is the same you have a
couple words that have changed meanings but most of the words has not changed
meetings so how am I going to do this rotation I’m going to solve this
optimization problem here so I’m going to just basically minimize this distance
with some sort of orthogonal matrix here and so this problem here is called a
Procrustes problem it actually has some interesting history
so Procrustes is this Creek I guess highwayman used to go around and like
collect people and bring them basically convinced him to come to his home and in
his home he had a bed that was an odd shape and he would tie people down to
his bed and if they didn’t fit he would cut off their arms and legs or he would
stretch them out and so he was a great serial killer and mathematicians decided
that the best name for this problem was to so yes so so basically this method works
by solving a series of percussive very small-scale profess this problems in
real time okay so that that’s nice for somebody’s let’s first of all every
problem you need to solve is really small right it’s just the word in its
neighbors is really not that big but it’s that because you have to deal with
stuff in real time right every time I want any a word I need to do this
computation again so it can be a little snow slow and also it runs into problems
if for some reason this base time corpus wasn’t good enough if you didn’t sample
it well enough then you might have some noise here so you won’t get a good
meaning but it’s it’s definitely one valid way of doing it sort of the the
medium approach is to instead of just rotating one word in its neighbor at a
time I can actually just rotate the entire matrix at a time I would only
have to do it one time and then I would never have to do it so it’s not a real
time computation it’s a slightly bigger problem to solve so instead of a word
and its neighbors we’re looking at the entire corpus but there’s another issue
with it is that you didn’t really take care of this undersampling problem so if
it wants to look at these three time scales here I’m gonna take this time
scale I’m gonna rotate it here I’m going to take this one I’m gonna rotate it
here okay what happens if this was a badly sampled year so this was a badly
sampled year and I forced this too to fit here this one will have a lot of
noise and then here this one will have an odd noise so you see a lot of issues
between the alignment of T equals three and equals one because you’re sort of
forcing it to fit you know a noisy configuration okay but this actually
also you know it’s very easy no sound and it actually works very well too but
remember that we’re going to push is this method here so we’re gonna solve
this much larger optimization problem and instead of solving a series of
Procrustes just going to do it all at once we’re
going to sort of deal with all of these alignments all at once and the main
reason is because now we’re not forcing these things to be exactly the same
we’re just forcing them to be close and this actually will help a lot with this
issue of what happens if one of these years is under sampled so of course we
have trade off so the trade off is that now instead of solving smaller problems
we are solving the biggest problem all right so this one was sort of maybe
but this one is very big we’re looking at all the time skills and so the hope
is that we can kind of come up with some way of doing this in a large-scale
setting and actually the solution to that is quite simple what we do is we
end up using black coordinate descent so this is the problem we want to solve
what we do is we first reformulate it so we break the symmetry so actually if
you’re familiar with black coordinate descent what we’re trying to do here is
for each one of these variables we’re gonna pick one time slice I’m going to
minimize this fixing all the time slices we’re just gonna solve for one time
slice so doing that over here is fine this is just sort of a quadratic problem
doing it over here is really hard because this is a quartic problem right
so finding a solution of a quartic polynomial is much harder than finding
the solution for a quadratic one so what we do is we break the symmetry we just
sort of take one of these use we turn it into V and then we set u easy so it’s a
very sort of common trick you add that trick so you break the symmetry here you
add this regularizer here you have to smoothing terms here and then you add
two more regularizes for good conditioning and then this entire
problem for one x less is actually fairly easy it’s just the solution of
the least squares problem so we do this we do some grid searching for all of
these different parameters so tau and lambda and gipa and then we just run it
for the number of epochs until the solutions look nice and so this is the
this is the problem going to this is the way we’re actually going to do our
dynamic board embeddings and now we’re going to apply to a corpus so the
corporates we pick is the New York Times like 1990 to 2016 we have about a
hundred thousand articles 59 sections we go through and we do some pre-processing
we can remove all stars and rare words ultimately we end up with a vocabulary
size of about 21,000 okay so just to give some results so this is
actually the concept evolution of the word Apple from 1990 to 2016 so what do
we notice right away we’d really thought that the New York Times was a very
neutral like corpus we figure they just tell us the meaning of the word it’s a
little corpus specific New York Times when you’re talking about Apple is in
1990s recipes right it’s about the home and family life kind of thing and so
most of what see here our ingredients and then over
time they turn to things like tablet iPod iPhone and then Google Microsoft
have some similarly with Amazon I started out with
the rainforest then it goes into sort of ecommerce and then nowadays it’s more
about content consumption and so you can actually see these things shift over
time one really fun thing to do is to see what happens with people
so Obama 1980 DS 2006 he’s associated with his more academic self and
university professor and then later on more towards his presidential life
president Trump similarly starting out with sort of owners stays more like you
know right you know real estate type of things if you like this he usually makes
the news and then more recently you know making the news and many other ways so
it’s actually would be quite interesting to extend this to the 2019 but
unfortunately we did this project in 2016 but you just seem to see what
happened since one thing we were actually able to do in this project is
change point detection so and data mining one thing we care about is not
just sort of how the concept is going along the window the concept changed
from one to another so for example for presidents we can see so basically what
we did here is we notice that the vector norm is very closely correlated with the
frequency of the word but because of our smoothing terms it’s much less noisy
than frequency when used as change my detection so in the past what you would
do is you would do some count based method you would count how many times
these words appear okay well we think that this is when the change point
happened but it’s hard to tell because of all of these spikes another issue
that comes up is if you’re looking at concepts that sort of are not like the
same in terms of magnitude you tend to see one concept like completely
overshadowing all the other concepts and I can actually make them disappear
we found a specter norms is that the vector norms that we compute are much
more robust against this and so we see things like we can actually detect when
did a president occur you know this is first Clinton and then Hillary Clinton
is over here we see you know first Bush second Bush and you know Obama and Trump
here rising at the end similarly with these concepts we can
kind of detect them rising and falling a little more steadily so this looks like it might be actually
a very useful tool for you know change point detection and data mining actually
the task that we ended up sort of trying and then we sort of decided we didn’t
want a more complicated optimization problem to deal with is this task here
well what we’re trying to do here is basically we give you a word so Obama in
2016 and we ask you what is the closest word in every other year and from that
we can basically like detects the current president for all the different
years and this is actually pretty robust against parameter tunes or a number of
epochs or anything like that we can almost always recover the correct
president we did the same thing with the New York Times New York City Mayor and
we basically almost quickly recovered all there was in 2006 we didn’t get any
people’s names at all and in 2011 we got Cuomo who is the governor of New York so
it’s actually pretty close so we were pretty like we were pretty surprised at
it as we thought oh we really created something magical we thought about it
for a long time we thought okay there may be one reason why this is happening
which is that the New York Times is a pretty like respect respectful newspaper
right when they say a president’s name during his presidency he they always
preface it preface it with president so all the doors are being pulled towards
president very closely over time the same thing with New York City Mayor so
we thought okay we wouldn’t need to give it a real challenge right we need to
give them a question that doesn’t have a title associated to it so we asked it
okay tell me something about tennis right so we never think about tennis
players with titles right so I mean we don’t call them with finer titles so we
gave it a question okay so ATP a number one ranked male player and then tell me
who was you know who is that player for each year and a boulder those are the
correct answers the non molded ones are not correct answers but they’re all
tennis players so it kind of got close but it wasn’t answering the question
exactly but it was actually giving a you know a ballpark answer in each case so
this was sort of yeah this is sort of the more I guess more honest question
answering half that we put it through okay so is the sort of give a conclusion
this project vectors their way of representing words
in a geometrically intuitive way and they’re usually used for pre-processing
for natural language processing tasks and the way to compute this is basically
variance and matrix factorization on some similarity matrix which is usually
some linear transfer or some transformation of the co-occurrence
matrix so we have dynamic orders which can be used to track concept evolution
and we can do this basically computing word vectors over time slices but we
have to do something to correct for rotational misalignment and this is the
same with dynamic or vectors and also with sort of cross cornice board vectors
if we’re trying to look at this a similarity between like two words from
two different sources you know say New York Times and MSN we would have to do
the same sort of thing and then we showed some results on our x corpus
which you know we can look at word trajectories or change going to touch
pin or question-answering also pause here to ask if there any questions on
this other thing with something on a better
button and get that number one right yes yes that’s what we did although so the
results I showed we call qualitative results we do also have quantitative
results so quite an results are a little trickier in NLP because usually you come
you propose it and then you make a public but then you validate on it
because they just aren’t that many like test brains out there a for example one
that we did was we looked at the association of words to each other
according to what section they appeared in so we sort of think of that as a two
completely different ways of associating words and then we see how well they
match up yes yes yeah yeah but I mean it is but yeah so so so we can do this for
a number of questions yeah yeah okay sorry reinforcement learning right now
you refresh is I’ve got them below one of the guys in Google okay stand for the
Google yes so I think um if the task is actually question and answering
reinforce my learning is exactly the way to go we were using question and
answering more of a way of evaluating how good our embeddings which is our we
were yet whatever using yeah question answering for that test so I think it we
wouldn’t really want it to cheat by being able to be sample the question but
definitely for question answering in general that would be the way to go yeah so thank you so now I’ll move on to
the second project so this project is a little bit changing gears it’s more
about machine learning interpretability and understanding how solvers are
actually working so finite support identic a shin and proximal methods so
the goal here now is to solve its first optimization problem which we will frame
as F is a convex function and he’ll smooth meaning that the gradients you
know it can’t differ too much right so if x and y don’t differ too much and the
gradient can’t differ too much lambda here is some regularization parameter
provided to us by the application and this last term here the l1 norm is the
sum of the absolute values of X and so here f of X is usually imposing as they
lost function and it’s sort of this constraint here is used as some kind of
regularization for model complexity so problems of this form appear in many
applications like signal processing image recovery bioinformatics in machine
learning this kind of problem is used to basically prevent model from over
overfitting right so we know that if we’re trying to separate excess from O’s
we can do so by sort of drawing a straight line or we can use a really
curvy line that perfectly separates the two data sets but we actually want the
simple model because that’s the one that’s going to work better on data we
haven’t yet seen okay so that would find our training data but that’s not really
what we’re interested in so there’s a lot of ways of actually reducing model
complexity but in the limit we’ll focus on here is basically by adding this
regularizer which forces X to be sparse and so if you’re forcing most of the
vector to be zero then you’re forcing model simplicity but you’re sort of
letting the algorithm choose which ones are zero which was a non zero and so an
application where sort of our optimization is useful is it feature
selection so in feature selection what you’re trying to do is you’re not
actually trying to optimize this solution and figure out what the final
vector X is you’re just trying to understand the support you just want to
know what are the indices of X or X I star is nonzero so why would we want to
do that so if maybe in game expression just want to know which genes are
responsible for some kind of female for some kind of expression so I want to
know what genes cause breast cancer I don’t really care how strong they are I
just want to know which ones they are or good efficacy I want to know you know
what kind of drugs looking into to actually you know present prevent some
disease I want very few number of drugs I don’t want a huge cocktail but at this
point maybe I don’t want to know their levels an object detection here it’s
this sort of lurid image I only need one dot here one dot here without here I
don’t need to know what’s going on in the rest of the image so there’s a lot
of sort of sports optimization problems where I don’t actually care about the
result I just care about the sparsity pattern okay so these problems using
what’s called a proximal method so the putzel method relies on this proximal
operator here which is a vector to vector map and so it’s basically the
solution to this optimization problem but for our purposes we just care about
the one norm here where the proximal operator for the what norm is acting as
a shrinkage operator so take the vector it looks at every element if that
element is greater than T it subtracts T if the element is smaller than minus T
it adds T everything else is squashes to 0 so it’s trying to squash things close
to 0 so you can see in this sort of two dimensional example if you haven’t slept
all the way out here you’re pulling it back at the x equals y direction over
here you’re pulling it and hits the axis and then it snaps at zero so you can see
why a method that uses a proximal operator will generally get you far
solutions right so it tries to get to the ear once it gets to 0 it just snaps
at 0 and stays there so we’re going to try to characterize the snapping
behavior a little more okay so proximal methods are methods that repeatedly call
the proximal operator on something that depends on the past and so there’s a
large family proximal methods the one that we see most often is the proximal
gradient method so it’s also you know if H here is a constraint it becomes a
projected gradient method so they’re basically the same method but so this
method is used very often in practice there are variations on this for example
using acceleration so you can’t miss the method and
and you can also look at things like Douglas record method an 80mm and these
are all sort of types of proximal methods and they all have methods one
thing we’ve observed in practice is that when you run any of these methods on
some kind of sparse optimization problem most of the time they will identify the
sparsity pattern very early right so if you look at the error in the residual
it’ll sort of keep going down forever if you look at the air and the support
it’ll just hit over here some finite time it’ll stop and at that point you’ve
you’ve discovered the correct support and it stays there so this is great for
us because it says that ok our task is to actually just identify the support of
something why don’t I just wrote a proximal method run it for 100
iterations and stop as long as I know that in 100 iterations I have the right
support I don’t need to run it to completion and so yeah so this is sort
of what we’re trying to do and often we only care about the red curve here’s
another example so in this task but we’re trying to do
is separate forests from knives and hand rated digits we won’t do too much
pre-processing here we’ll just take these raw images and applies a sparse
logistic regression so here I’ve plotted the objective air and at each point we
can also look at the feature right so the so feature here is sort of poised in
two dimension because this data is in two dimension and you can actually see
which features are being selected and you can see after a while you start to
see like this feature coming up the most right that feature sort of corresponds
to the ridge in the floor night right so looking at fours and nine right this
right here is actually really important so that thing is starting to light up
here and we notice that basically we’re able to discover this pretty early on so
each of these images the title of these images contain a Trane air in the test
air and we notice is a level off very early on in machine learning this is
problem of we don’t really know when to stop because we know that these
performance metrics tend to level off you soon even when the objective air
hasn’t gone to zero and I think it’s still an open problem but one thing we
do know is that we can tell you when the support has been identified and at that
point you know that’s a good indication of when we
should stop because the support is actually identified pretty early on okay
so now we’re going to try to characterize all of this mathematically
so we’re going to say that a method with iterates XK 2 X daughter identifies the
support at iteration Bar K and for all K greater than Bar K X I K is equal to 0
whenever X I star is equal to 0 and it seems like a little bit of a one-sided
definition of support but actually it’s the side that we can prove so the
question we’re gonna ask in this in this in this part is is it true that all
proximal methods identified with support and finite iterations can we actually
show that this is probably true and the answer is often it’s true but sometimes
it’s false and if we actually looked at the literature we’ll see basically all
the authors they take a method they analyze it you know to death they
introduce things like differential geometry and all these different tools
and they tell me something about that method but in fact we can say something
across the board you don’t really have to look at each method separately we can
actually say something just knowing that they’re all using the problem operator
and this will help us a lot because it’ll help us understand not so much a
method dependent metric but a data dependent metric on when the support can
be identified ok so the key thing that we need to know here is to how to
actually solve this problem like what are the optimality conditions for
solving this optimization problem so let’s look at each individual index of X
separately so we take 1 X I and we say okay at the solution X I star is not 0
if it’s not 0 then this l1 term over there is smooth and so I know that the
gradients is is going to be 0 at the optimum right so we’re in this first
case here we know the gradient is 0 then we set everything equal to 0 and we do
some algebra and we get that that basically means that if X I star is
negative that if I defy star as the negative gradient then Zi star is
negative lambda FX I star is positive and B a star is limbed that’s just sort
of done by taking everything and setting it to 0 then what happens if X star is 0
and we know that it sort of you know it’s like it has this absolute value
rates at the cusp of this absolute value that’s being tilted this way or this way
by the smooth term but you can’t tilt it too far because if it’s tilt its you
fought then the room is no longer at zero right it’ll sort of roll off so how
far can you tilt it well you can tilt it from minus lambda to lambda so these
three conditions exactly quantify that X is optimal so if X I thought is
satisfying all these three conditions we know that it’s the solution of that
problem we can actually you sort of plot it here so we have X I star and then if
Z star is X star here and E star at the negative gradient basically if X star is
positive and Z star must be sort of all the way up to Linda otherwise it must be
all the way down then minus lambda and if X I is you know then it has to be
somewhere in the middle but it doesn’t really know where so if we can show that
this kind of if we if you give me an X and I plotted the X and the negative
gradient and it looked like that then I tell you for sure that this is the
correct solution we don’t have to do any more work okay so remember that we give
and here is basically we would call it the wiggle room lemma we sort of tell
you that this is sort of how much air you’re allowed I have and still be able
to identify the support correctly so we have this sort of data dependent
quantity so Delta min and basically we say that for any algorithm of this form
here then we noted at some iteration K the current support is identified if
this residual here is smaller than this data dependence constant okay so this is
sort of a mouthful let’s look a little bit intuitively what that means okay so
that’s the condition we need to be true at some event so we’re going to define
Delta I at the gap between any of these the eyes to the end right so we knew
that if X I star is here is nonzero then this gap is zero so that doesn’t that’s
not really interesting when X I star is 0 there’s some gap and we’re going to
measure that gap and the small is one of these gaps is Delta min then at some iteration k we have a noisy
X and that gives us a noisy K it’s the noisy X how much noise can we still have
and still have the condition be satisfied well if X is X I start at zero
we are not allowed any noise like any a lot of noise and it could sort of break
their condition but if X is nonzero there were allowed exactly Feltham in
amount of noise if we have just that little bit amount of noise we can note
that for whatever X I star is zero the condition is still met and so then at
that point the support is identified so basically so basically what this
condition is saying is that if basically arlynn toys and the gradient is smaller
than this Delta min then we have identified the support okay so how can
we actually use this lemma to give some kind of support identification
complexity rate okay so how we actually use this lemma so this is the condition
we need to be true we take our method and we insert YK according to the
algorithm so if we’re projected gradient descent YK is just a current iterate
subtracting the gradient we sort of plug it into this part here and sort of
rearrange terms and we see that this term here breaks into one term that’s
the error in the variable and one term this the air and the gradient so this
already tells us that finite support recovery will happen when delta min is
positive because we know that XK eventually converges to X star and the
gradient eventually concurred to the true gradient so convergence basically
means that at some point that the amount of error has to be smaller than this
Delta min term and so to figure out how far it goes we just sort of need to sort
of go through literature and we find our favorite rate of how X K actually
converges to X star we use our assumption here to bound the gradients
and so for example in projected gradient descent we know that if if f of X here
is Trek’s then we have this kind of convergence rate here we invert this to
get K bar and we see that the active set complexity the support identification
complexity rate is exactly log of Delta min over this
so here we’re actually able to recreate one of the previous rates that we use
you know many fewer lines to get there just by exploiting this wiggle root
level and so this is this is how it works for protected gradient descent we
can additionally use that same trick for all the other proximal Canadian methods
which is really sort of the strength of this technique or able to do it for a
great proximal gradient Douglas ratcheting 80 mm and also proximal
Newton’s which is probably no hasn’t previously been explored and the so just
just to make everything fair the previous works that have looked at each
of these they have looked at each method separately and they’ve given a lot of
different rates on a lot of different aspects of these methods we’re only
looking at support identification and because we’re just looking at a small
piece of the puzzle we’re able to do it for all of these methods in one broad
stroke and using many simpler produce okay so that was nice but we really want
to scale up right so most of us are probably not even using gradient descent
anymore we’re just using the Kafe gradient descent so this is work in the
stochastic gradient descent setting so now instead of taking gradient steps
we’re now taking stochastic steps so basically what does that mean at each
iteration we’re going to so the problem we’re trying minimize is actually
composite of different convex and smooth functions we’re going to pick one of
these uniformly at random at each iteration we’re going to take a gradient
step and then we’re going to shrink it and we know that basically that this
method will confess sorry so we don’t have this method will converge but we
have an additional sort of constraint which is that this step size has the
decay rights the cat’s-eye gradient descent the step size half a decay
otherwise we won’t Kerch okay so we’re going to try to do the same trick that
we did previously we’re gonna write out our residual term
we’re going to organize it in terms of an error in the variable in the air and
the gradient and we’re going to say well if both this term goes to zero so
basically if this entire gradient term convergence higher variable error term
converges then we’re fine so does this term converge in general so
unfortunately no right so we know that in general so take for example f of X
goals X minus one squared plus X plus one squared then we know the solution is
x star equals zero but we know that the gradient of this part will be minus one
the gradient of this part will be one the gradient of the whole thing will be
zero right so in fact they can’t possibly possibly actually converge even
when you’ve actually reached a true solution so this is a big problem with
stochastic gradient descent so this this is not true in general the second step
is actually still not true in general right so previously we said that ok the
variable will always converge to the true solution but that was when we had a
constant step size when we use a decaying step size that this decay is
faster than the correctness of XK deckstar this term also doesn’t go to
zero so in stochastic gradient descent we’re pretty like we’re pretty stuck it
doesn’t really seem easy to prove that there is any kind of support
identification and in practice we actually find that that’s completely
consistent with our observations that the support identification does not
happen for a proximal stochastic gradient descent so here that we just
sort of did a very small example using constant subsides nothing converges you
must have a decaying step size and then you can see that the variable will
converge but the sparsity error will not converge and that’s the same no matter
what kind of rate you use so I think one thing to clarify here a question might
have is ok well if the variable goes to 0 the wife at the sparsity error does go
to 0 well so you can see if this is sort of the true sparsity pattern you might
have something that’s like this right so you might have a very very small number
here there’s a lot of number might converge to 0 but it will never actually
reach 0 so for first-order that there’s a Memphis we looked at before they will
actually reach exactly zero in finite time but for stochastic methods they
they just keep getting closer and closer but they won’t actually reach 0 so yeah so are we sort of lost is there
no hope so luckily if hope there’s sort of a newer sort of stream of methods
coming in these variants reduce the casting methods and the way we kind of
work is instead of taking these partial gradients every once in a while they’ll
take a full gradient and because of that they’re sort of able to reduce the
variance in each of these stuffs and so you might say this is cheating which I’m
allowed to take a full gradient why don’t I just do full gradient descent
but actually they do it so that the number of fold gradients you take is
like an order less than the number you would take with full gradients and still
achieve good convergence so anyway so these methods are sort of yeah they’re
coming out now and then so there are a lot of advantages here when you actually
have this variance would use behavior you can use constant step size and also
you get support identification and you get it fairly quickly so that’s actually
shows that you know you can actually scale up and still get support
identification and we’re actually able to prove it also using wiggle room lemma
for the introduced methods but not for stochastic gradient descent so this
shows that our pro technique is somewhat consistent with what’s observed in
practice okay so that’s sort of the main point here
right so we’re trying to do is we’re trying to solve this first optimization
problem where we know the solution is part of mostly zero we’re looking at a
family of methods these proximal methods that we have observed in practice tend
to lock onto the zeros early on right they tend to be able to identify the
support early on and so we characterize this using this wiggle room lemma which
we basically show using that geometric proof and then from that level were able
to give the finite support complexity bounce for deterministic proximal
methods and festa Castaic variance reduce proximal methods but opera
stochastic gradient descent and said we just offer some intuition as to why
cannot happen so this is sort of the second project so I’ll pause here to ask
if there any questions here yeah so this is yes yeah so my question goes back to the
slide we were discussing about that digit classification thing when you were
trying to see what how many L in minimum number of pixels you need to classify
between four and nine yeah so there have been some work on this same thing like I
read about record inattention models so have you compared your results with the
one’s attention model sort of they use reinforcement learning to do image
classification when they’re trying to find the least number of pixels you
require to classify whether a digit is two three four or five yeah I guess
actually that probably works a lot better this is less about trying to come
up with a like these are sort of a very simple method that’s very simple to run
like the polite run in two or three lines mm-hmm that people are using
mostly because it’s easy to run but if your goal is to actually identify you
know thus far as a support set yeah something like this would work a lot
better especially if your images have sort of you’re trying to semantics yeah
so you see the number of iterations what
does it really mean I mean what is 100 iterations first only iterations means
what is that for integrally the examples are digit right yes
what are we doing why are we stopping and I’m very graphic but it is yeah I
guess this this chart is more to look at the relative difference between three
hundred iterations when the supportive thing identified and you know a thousand
iterations when you when your solver might actually tell you this is an okay
place to stop iteration what is it important what’s the output so the input
is uh well that’s like the 60,000 Emnes digits a so the output will be you know
1 if it’s 4 and 0 if it’s 9 well I guess that breaks down maybe eleven thousand
in the digits yeah this is just I mean I guess I guess if I were to use a
different data set if I use a different step size I would get different number
of iterations before any of this happen so that’s that’s definitely a good thing
to keep a good point I think the relative difference between
when the supportive identified and when the residual goes to zero is maintained
but it’s yeah it’s true sort of that the numbers is somewhat meaningless
depending on the dataset and depending on them yeah unless you treat the sub sighs yeah thank you for the questions so I think so now I’ll talk a little bit about sort
of my vision for future projects to look at okay so we sort of looked in the
first project how this kind of pre-processing trick of computing these
word vectors significantly helped machine learning in general in the
natural language processing world right so we know this first of all in general
real world data is very diverse and we know that we actually do a lot of
pre-processing for a lot of different reasons whether it’s anonymization
encoding for robustness but it also in representational learning right
instantly know that in fact each of these like what we actually end up
feeding into our classifier the distribution of that data will affect
how well our classifier works whether it’s you know things like no separate
ability or sort of incoherence we would like to be able to characterize these
better and one thing we actually realized in sort of the second project
is we have this constant let’s see if we can find it the state of dependent
constant Delta meant here which unfortunately depends on the solution of
the problem so it’s data dependent in the sense that it does not depend on the
method at all but you can’t ascertain this without actually solving the
problem at first it’s a little bit unsatisfying so one thing I think that
really should be done is to try to understand how each of these constants
depend on the data so that’s sort of this is our basically a very open
question but basically how how does things like Delta min or how does things
like strong convexity or all these different constants relate to things
like the margin the incoherence all these different things and there’s been
quite a bit of work on this already so things like you know since these papers
in 2010 look they’re things like restricted strong convexity look at
things like pseudo self concordance a sort of data dependent ways of
characterizing how well the optimization will do but I think there’s a lot of
more work to do in this area so I think this is a very open question
in general to look at and so you mentioned as well reinforcement learning
on reinforcement learning kind of problems for online optimization
problems in general present a very like twist to optimization right so we’re
trying to optimize this this objective function but f changes according to the
data and you only observe this in a one sample at a time so so really here our
goal is not just to tell you how fast as a method converge but how many samples
we have to see before the method converges and so this is a very
important problem and you know Penn has reinforced opening boosting game theory
and you know this is one of those areas where we know there are a lot of really
nice techniques that can get us a good solution without seeing everything but
we don’t really have you know it’s more than part it’s more of an intuition of
how to get there so it just as an example right so we’re in reinforcement
learning we’re looking at things like we’re trying to learn a policy or a cue
function which is parametrized by a sparse sum of basis functions we want to
know things like what kind of basis function should we use to actually
characterize the whole state space can we actually offer convergence guarantees
based on sparse observations how many observations do we need for each kind of
basis so I think this is a very interesting open question still and you know every one of these things
is interested in deep flirting you know of course we shouldn’t be
interested in deep learning because it’s very much becoming a such a big tool and
there’s a lot of very interesting optimization questions here that are not
yet being solved so I don’t know if you guys know about I think I think it’s a
2016 test of time talk and if swear Ali Rahimi sort of gave a whole list of
things about deep learning that we sort of know but we don’t really know and so
things on there like you know passionate realization or drop out other things
I’ve seen people use a lot like minimizing entropy to try to get my
clustering activity these are all things that are not really standard
optimization practices and so you know while we understand a lot about how to
norm regularization and one normalization gives us model complexity
reduced model complexity we don’t know how these things give us reduce model
complexity and so we should know these are definitely things that we should
look at more I think even the simple question of the depth is not very well
answered we know that deeper networks have higher model complexity but in
general people don’t prefer just to have a really deep network they need to have
some trick to solve it to train it otherwise it’s impossible to train so
they do these things like how do I know it’s memory elements attention networks
if those things are not sort of characterized in the you know
fundamental theory you have the express ability of deep neural nets so that’s
something that to look at it probably the most popular optimization problem
today Andy flirting is these general adversarial networks like the gans so we
know that these kids can sort of generate very interesting pictures from
distributions and we know that they’re sort of framed as a saddle point problem
or a two-player game where one player sort of offers some images and the other
player decides if this image is a real image or a generated image and that’s
supposed to make those full players better but what usually happens is the
first player gives a noise and the other player set places that’s noise like it
correctly class noise so this is a critically valid solution but it’s
totally useless and so the open question is sort of how can you actually go about
sort of constructing these scans to avoid this modal collapse problem so
this is sort of a very important optimization problem so
if so thank you for coming to this talk oh are you willing to share no I think
it’s actually okay thank you so it is curious to know what are your
comments on the model complexity versus the real-time update like the
over-the-air update that’s been recently been famous in stuff like hot ramos
wiki’s wherein if you have a model on the server and you want some real-time
updates do you actually focus more on the
complexity of the model or you may be what you call till time update more than
the accuracy thank what’s your time sorry actually not very familiar with
this recent work so how does this recent work if you put it in a simpler way
would you prefer the accuracy mode or the latency more in terms of like real
gold cases I would have to ask the machine learner which one is more
important that is a new is if a new data actually important because maybe it’s
because the user is reacting to the way the model is and therefore it’s very
important or is it just sort of Epsilon more samples in which case it’s not that
important I would just have like some kind of question that you mentioned
another musician typically in healthcare they have that UGC of an itemization and
how would you do with price optimisation and an organization in this case because
typically if you can extract the vectors in the directions and parallel they are
they’re still information yeah yeah that could actually make you retrieve the
initial guess of what it is right yeah so one thing we did look at where as far
as sketching matrices but I’m not sure that actually can effectively do what
you’re saying but start sketching matrices would actually sort of it’s
sort of the equivalent of multiplying a by a Gaussian random matrix but where
you don’t lose Parsa T so you still have computational efficiency but in fact
it’s actually not affect they still an open question to me is whether that’s
solving this privacy issue because maybe you can still recover it certainly
wouldn’t be easy to recover it but you will still have their relationships in
Newton B so typically if I had to two patients who have similar you can
specify it on the data side to like sort of you have your data feature matrix you
can you can multiply by random matrix on this side or you
hopefully on this side it seems like you would scramble up the users at least there’s a finite support ID methods
because it’s used where it cuts my brilliance
it’s this variety and then does it you’ve also use for Ally constrain
gradient methods like a throwing well friend that does have this but it’s not
the same proof it does have sparsity at five years more because it’s taking
atoms on her time and it stops when it has enough atom so usually it will pick
the right ones first yeah so the analysis would not be the same that’s
actually one thing we’re looking to extend to but it’s not the same because
it’s not actually using a proximal operator it doesn’t shrink so it goes
from the opposite direction kind of thing so but yes but the bulb is very
useful for this type of thing if there’s no more questions I’d like to
thank our speaker he’s been son

Leave a Reply

Your email address will not be published. Required fields are marked *