complexity and of course we also have to

think about how we’re actually going to do these improvements and how we

actually resample our data intelligibly and so I think all of these areas are

very interesting research areas to look into today I’ll just talk about two of

these areas so I’ll talk about some work we looked at in sort of these semantic

liesl distributions and some on interpretability four in the model

complexity area okay so it’s just an outline so first I’ll talk about these

two projects and then afterwards I’ll talk a little bit about my vision for

future work so in this first project we’re looking at dynamic word

representations and it’s used in data mining applications and this was a sort

of summer project work we did at Technicolor okay so if we’ve ever worked

in natural language processing we kind of know that you know there are a lot of

different tasks in natural language processing whether it’s text

summarization or mesh or machine translation or sentiment analysis and

they can kind of be broken down to small subtasks like you know part of speech

classification for example and all of these applications we have to sort of

find a way to take words as inputs and something as output and so how do you

actually take a word and represent it as a number it’s not really clear right so

the easiest way to do this is to take a giant vector which is the size of the

number of words in the English language and say okay that vectors are zeroes

with a 1 if that word is being selected and 0 otherwise and so we can create a

lot of problems involving here right so it’s for example let’s just say that we

wanted to use a neural network classifier then the first layer of that

neural network has 170 so it’s first first dimension in terms of the number

of parameters and then it sort of reduces down to something like 20 or 30

at the end so something like this is very hard to train and even if you could

train it it’s going to over fit quite a bit so it’s really not an ideal thing to

do so what do people do people use word vectors so what our word vectors they’re

basically ways of representing words in low dimensional space and so actually if

we look back over here this seems like a really strange way to represent words

because this vector in itself has very little like

degrees of freedom right we have we just sort of saying that everything has to be

all zeros with one one it’s not it’s not really using the entire 170,000

dimensional space very well so if if instead we take these words and we

represent them in a much lower dimensional space but in a word that

geometrically captures some language structure we might be able to do a lot

better and so these one vector is what they’ll actually produce is Lorde’s

living in something like 50 dimensional or 100 dimensional space where or is a

similar meaning cluster together and you also get interesting algebraic sort of

structures for example the orientation of the vector of man – woman is the same

as uncle – aunt and king – Queen and so these sort of things make you make you

feel like these vectors are actually somehow geometrically capturing language

structure so why is that useful now we actually take these word vectors and we

put them through a narrow Network this neural network becomes much easier to

Train and in fact the performance will become much better and so you see it

quotes like so the use of word representations is called a secret sauce

for the success of many NLP systems in recent years across tasks so you know

named entity recognition sentiment analysis all of these things across all

these tasks just by replacing the word with the word vector you can usually do

better okay so we’re actually going to take this a step further and use these

word vectors for something slightly different so we know that word that

words in general are not a static quantity right they’re changing over

time they have sort of emerging meanings and dying meanings right so old words

fall out of use new words come into use and we also have a lot of words like you

know names of brands or names of people that are constantly changing their

Association as that random person that changes Association and so actually this

is one of the big motivations of this work which is if I McDonald and I know

that in 1960 my company was associated with family-friendly environments and

like nice you know dinners going out and in 2012 it’s associated with obesity and

hypertension I really need to know this so I can keep rebranding right and same

with politicians so this handsome gentleman to the bottom here is Barack

Obama or friendly President Barack Obama long before he was president and so you

know people you know so keeping track of these associations

can actually be done through these word vectors and it actually can help them

with their tasks so okay so the goal of this first project is asking can we

actually track word vectors through time to show a concept evolution okay so we

have to take a step back now and actually to talk about how we’re

actually computing these word vectors so let’s first look at the static case so

the case of static vectors the way we do it is we sort of form what’s called a

co-occurrence matrix so the co-occurrence matrix basically you take

a really really long corpus and then you go through and you do some

pre-processing so you discard words that are appearing too often but really

aren’t that informative and you sort of describe like railroads that don’t

appear much at all and then you use a sliding window and just slide through

the entire corpus and every time two words appear in the same sliding window

you add a count so then you form this sort of sort of second order account of

everything going on in the in the corpus and the assumption here is that this

co-occurrence matrix acts as a similarity matrix so there’s this famous

code in linguistics like you shall know a word by a company it keeps so words

that sort of fall like where that are near other words in space are somehow

similar to each other in meaning so if I’ll just take this matrix just the way

it is I do some kind of factorization on it for example I take the top singular

vectors of it already I can do a lot of amazing things and so this this roll

result is from 1990s they just took you know a fairly sizable corpus and then

they just did an SVD on it and already they can do a lot of clustering like

parts of the body animals and places with like some mistake so tooth is sort

of in the wrong cluster but they actually do very well just by that very

simple procedure so this tells us that maybe this assumption that words appear

together there are similar is actually probably very true so how has the

technology changed since then well one issue that people notice is that the

English language is highly unbalanced so we have a few words that appear almost

all the time and most words don’t appear much at all

and so this makes the co-occurrence matrix kind of unstable it’s not very

like you know you have some words some numbers that are way too big and then

everything else gets squashed okay so we just do the typical tricks we usually do

in this kind of scenario we do some waiting we divide by the frequency and

then you know we did a typical trick we do and we don’t want big numbers to be

too big we just take the log all right and so now it looks kind of like a

mutual information pairwise mutual information matrix and then one issue

people realize when they actually started working with this matrix is okay

so if I have limits that basically don’t occur at all it’s basically zero so I

look at this weighted co-occurrence matrix that values close to zero that’s

fine I take the log of it becomes something

negative very very very negative and so that actually starts throwing things off

so when you barely barely have enough samples you have very very large numbers

and so what they do is they just threshold it at zero so in reality what

you’re really working with is the PPM fly matrix so the positive paralyze

mutual information use matrix and this is what we’re going to use to represent

similarity between words okay so then take this matrix and I do some

kind of factorization so I do some way of making it into this sort of uut

structure and then this matrix here each row of this matrix is then my word

embedding and this this technique is actually pretty much how most word

embeddings are created in some way or another right so sort of to to get to

the solution it’s the solution of a non convex optimization problem so how you

actually go about solving it will affect the quality of your solution but

ultimately each of these different methods they’re trying to get to this in

a different way so you know in the past they look at these counts matrix but

more recently you look at the PP MIT trick so whether you’re doing an SVD

whether you’re doing a shallow neural network or whether you’re doing the

squares problem they’re all sort of trying to do this kind of matrix

factorization and they they will actually get different performances on

different tasks because they are using different ways of getting there so

that’s how the static cakes works and now what we’re trying to do is were

trying to do concept evolution right we’re trying to understand how we’re

changing meaning over time so the simplest way it seems to do that is we

take this corpus and we slice it into time frames and for each time frame we

just do the same thing right we just do the same thing we did for the static

embedding so we do some kind of factorization we look at these vectors

and we see what happens okay so unfortunately this will not work

so what’s wrong with why won’t this work so there’s some kind of physically

rotational variants when it comes to matrix

factorization so basically you can think about it as if I told you this the

distances between all the different nodes in a configuration that tells you

the configuration but it doesn’t tell you the rotation or the translation

right so if I take two different time frames I do this kind of matrix

factorization the meeting may not have changed but the vector will change quite

a bit if I don’t account for this rotational invariance I suspect over

here jumped over here but in fact if the meaning didn’t change so dynamic where

representations most of the works they differ by basically telling you how they

actually go about enforcing this smoothing so dealing with this

rotational invariance that’s basically that the main mathematical part of it so

for example so the first so I’ll go through a couple different examples one

example the first way they do it is they basically said okay give me the word you

want to look for and I’ll just do that kind of rotation in time so what they

have is they have one corpus which is very well sampled and that’s their base

time corpus and then now you give me a new year so instead of cat in 1990 I

went to like a cat in 2016 I look at the word cat and I look at the words close

to that so maybe dog you know turtle pet all these different things and then I

just take this configuration here and rotate it back in real-time and the hope

is that for the most part this configuration is the same you have a

couple words that have changed meanings but most of the words has not changed

meetings so how am I going to do this rotation I’m going to solve this

optimization problem here so I’m going to just basically minimize this distance

with some sort of orthogonal matrix here and so this problem here is called a

Procrustes problem it actually has some interesting history

so Procrustes is this Creek I guess highwayman used to go around and like

collect people and bring them basically convinced him to come to his home and in

his home he had a bed that was an odd shape and he would tie people down to

his bed and if they didn’t fit he would cut off their arms and legs or he would

stretch them out and so he was a great serial killer and mathematicians decided

that the best name for this problem was to so yes so so basically this method works

by solving a series of percussive very small-scale profess this problems in

real time okay so that that’s nice for somebody’s let’s first of all every

problem you need to solve is really small right it’s just the word in its

neighbors is really not that big but it’s that because you have to deal with

stuff in real time right every time I want any a word I need to do this

computation again so it can be a little snow slow and also it runs into problems

if for some reason this base time corpus wasn’t good enough if you didn’t sample

it well enough then you might have some noise here so you won’t get a good

meaning but it’s it’s definitely one valid way of doing it sort of the the

medium approach is to instead of just rotating one word in its neighbor at a

time I can actually just rotate the entire matrix at a time I would only

have to do it one time and then I would never have to do it so it’s not a real

time computation it’s a slightly bigger problem to solve so instead of a word

and its neighbors we’re looking at the entire corpus but there’s another issue

with it is that you didn’t really take care of this undersampling problem so if

it wants to look at these three time scales here I’m gonna take this time

scale I’m gonna rotate it here I’m going to take this one I’m gonna rotate it

here okay what happens if this was a badly sampled year so this was a badly

sampled year and I forced this too to fit here this one will have a lot of

noise and then here this one will have an odd noise so you see a lot of issues

between the alignment of T equals three and equals one because you’re sort of

forcing it to fit you know a noisy configuration okay but this actually

also you know it’s very easy no sound and it actually works very well too but

remember that we’re going to push is this method here so we’re gonna solve

this much larger optimization problem and instead of solving a series of

Procrustes just going to do it all at once we’re

going to sort of deal with all of these alignments all at once and the main

reason is because now we’re not forcing these things to be exactly the same

we’re just forcing them to be close and this actually will help a lot with this

issue of what happens if one of these years is under sampled so of course we

have trade off so the trade off is that now instead of solving smaller problems

we are solving the biggest problem all right so this one was sort of maybe

but this one is very big we’re looking at all the time skills and so the hope

is that we can kind of come up with some way of doing this in a large-scale

setting and actually the solution to that is quite simple what we do is we

end up using black coordinate descent so this is the problem we want to solve

what we do is we first reformulate it so we break the symmetry so actually if

you’re familiar with black coordinate descent what we’re trying to do here is

for each one of these variables we’re gonna pick one time slice I’m going to

minimize this fixing all the time slices we’re just gonna solve for one time

slice so doing that over here is fine this is just sort of a quadratic problem

doing it over here is really hard because this is a quartic problem right

so finding a solution of a quartic polynomial is much harder than finding

the solution for a quadratic one so what we do is we break the symmetry we just

sort of take one of these use we turn it into V and then we set u easy so it’s a

very sort of common trick you add that trick so you break the symmetry here you

add this regularizer here you have to smoothing terms here and then you add

two more regularizes for good conditioning and then this entire

problem for one x less is actually fairly easy it’s just the solution of

the least squares problem so we do this we do some grid searching for all of

these different parameters so tau and lambda and gipa and then we just run it

for the number of epochs until the solutions look nice and so this is the

this is the problem going to this is the way we’re actually going to do our

dynamic board embeddings and now we’re going to apply to a corpus so the

corporates we pick is the New York Times like 1990 to 2016 we have about a

hundred thousand articles 59 sections we go through and we do some pre-processing

we can remove all stars and rare words ultimately we end up with a vocabulary

size of about 21,000 okay so just to give some results so this is

actually the concept evolution of the word Apple from 1990 to 2016 so what do

we notice right away we’d really thought that the New York Times was a very

neutral like corpus we figure they just tell us the meaning of the word it’s a

little corpus specific New York Times when you’re talking about Apple is in

1990s recipes right it’s about the home and family life kind of thing and so

most of what see here our ingredients and then over

time they turn to things like tablet iPod iPhone and then Google Microsoft

have some similarly with Amazon I started out with

the rainforest then it goes into sort of ecommerce and then nowadays it’s more

about content consumption and so you can actually see these things shift over

time one really fun thing to do is to see what happens with people

so Obama 1980 DS 2006 he’s associated with his more academic self and

university professor and then later on more towards his presidential life

president Trump similarly starting out with sort of owners stays more like you

know right you know real estate type of things if you like this he usually makes

the news and then more recently you know making the news and many other ways so

it’s actually would be quite interesting to extend this to the 2019 but

unfortunately we did this project in 2016 but you just seem to see what

happened since one thing we were actually able to do in this project is

change point detection so and data mining one thing we care about is not

just sort of how the concept is going along the window the concept changed

from one to another so for example for presidents we can see so basically what

we did here is we notice that the vector norm is very closely correlated with the

frequency of the word but because of our smoothing terms it’s much less noisy

than frequency when used as change my detection so in the past what you would

do is you would do some count based method you would count how many times

these words appear okay well we think that this is when the change point

happened but it’s hard to tell because of all of these spikes another issue

that comes up is if you’re looking at concepts that sort of are not like the

same in terms of magnitude you tend to see one concept like completely

overshadowing all the other concepts and I can actually make them disappear

we found a specter norms is that the vector norms that we compute are much

more robust against this and so we see things like we can actually detect when

did a president occur you know this is first Clinton and then Hillary Clinton

is over here we see you know first Bush second Bush and you know Obama and Trump

here rising at the end similarly with these concepts we can

kind of detect them rising and falling a little more steadily so this looks like it might be actually

a very useful tool for you know change point detection and data mining actually

the task that we ended up sort of trying and then we sort of decided we didn’t

want a more complicated optimization problem to deal with is this task here

well what we’re trying to do here is basically we give you a word so Obama in

2016 and we ask you what is the closest word in every other year and from that

we can basically like detects the current president for all the different

years and this is actually pretty robust against parameter tunes or a number of

epochs or anything like that we can almost always recover the correct

president we did the same thing with the New York Times New York City Mayor and

we basically almost quickly recovered all there was in 2006 we didn’t get any

people’s names at all and in 2011 we got Cuomo who is the governor of New York so

it’s actually pretty close so we were pretty like we were pretty surprised at

it as we thought oh we really created something magical we thought about it

for a long time we thought okay there may be one reason why this is happening

which is that the New York Times is a pretty like respect respectful newspaper

right when they say a president’s name during his presidency he they always

preface it preface it with president so all the doors are being pulled towards

president very closely over time the same thing with New York City Mayor so

we thought okay we wouldn’t need to give it a real challenge right we need to

give them a question that doesn’t have a title associated to it so we asked it

okay tell me something about tennis right so we never think about tennis

players with titles right so I mean we don’t call them with finer titles so we

gave it a question okay so ATP a number one ranked male player and then tell me

who was you know who is that player for each year and a boulder those are the

correct answers the non molded ones are not correct answers but they’re all

tennis players so it kind of got close but it wasn’t answering the question

exactly but it was actually giving a you know a ballpark answer in each case so

this was sort of yeah this is sort of the more I guess more honest question

answering half that we put it through okay so is the sort of give a conclusion

this project vectors their way of representing words

in a geometrically intuitive way and they’re usually used for pre-processing

for natural language processing tasks and the way to compute this is basically

variance and matrix factorization on some similarity matrix which is usually

some linear transfer or some transformation of the co-occurrence

matrix so we have dynamic orders which can be used to track concept evolution

and we can do this basically computing word vectors over time slices but we

have to do something to correct for rotational misalignment and this is the

same with dynamic or vectors and also with sort of cross cornice board vectors

if we’re trying to look at this a similarity between like two words from

two different sources you know say New York Times and MSN we would have to do

the same sort of thing and then we showed some results on our x corpus

which you know we can look at word trajectories or change going to touch

pin or question-answering also pause here to ask if there any questions on

this other thing with something on a better

button and get that number one right yes yes that’s what we did although so the

results I showed we call qualitative results we do also have quantitative

results so quite an results are a little trickier in NLP because usually you come

you propose it and then you make a public but then you validate on it

because they just aren’t that many like test brains out there a for example one

that we did was we looked at the association of words to each other

according to what section they appeared in so we sort of think of that as a two

completely different ways of associating words and then we see how well they

match up yes yes yeah yeah but I mean it is but yeah so so so we can do this for

a number of questions yeah yeah okay sorry reinforcement learning right now

you refresh is I’ve got them below one of the guys in Google okay stand for the

Google yes so I think um if the task is actually question and answering

reinforce my learning is exactly the way to go we were using question and

answering more of a way of evaluating how good our embeddings which is our we

were yet whatever using yeah question answering for that test so I think it we

wouldn’t really want it to cheat by being able to be sample the question but

definitely for question answering in general that would be the way to go yeah so thank you so now I’ll move on to

the second project so this project is a little bit changing gears it’s more

about machine learning interpretability and understanding how solvers are

actually working so finite support identic a shin and proximal methods so

the goal here now is to solve its first optimization problem which we will frame

as F is a convex function and he’ll smooth meaning that the gradients you

know it can’t differ too much right so if x and y don’t differ too much and the

gradient can’t differ too much lambda here is some regularization parameter

provided to us by the application and this last term here the l1 norm is the

sum of the absolute values of X and so here f of X is usually imposing as they

lost function and it’s sort of this constraint here is used as some kind of

regularization for model complexity so problems of this form appear in many

applications like signal processing image recovery bioinformatics in machine

learning this kind of problem is used to basically prevent model from over

overfitting right so we know that if we’re trying to separate excess from O’s

we can do so by sort of drawing a straight line or we can use a really

curvy line that perfectly separates the two data sets but we actually want the

simple model because that’s the one that’s going to work better on data we

haven’t yet seen okay so that would find our training data but that’s not really

what we’re interested in so there’s a lot of ways of actually reducing model

complexity but in the limit we’ll focus on here is basically by adding this

regularizer which forces X to be sparse and so if you’re forcing most of the

vector to be zero then you’re forcing model simplicity but you’re sort of

letting the algorithm choose which ones are zero which was a non zero and so an

application where sort of our optimization is useful is it feature

selection so in feature selection what you’re trying to do is you’re not

actually trying to optimize this solution and figure out what the final

vector X is you’re just trying to understand the support you just want to

know what are the indices of X or X I star is nonzero so why would we want to

do that so if maybe in game expression just want to know which genes are

responsible for some kind of female for some kind of expression so I want to

know what genes cause breast cancer I don’t really care how strong they are I

just want to know which ones they are or good efficacy I want to know you know

what kind of drugs looking into to actually you know present prevent some

disease I want very few number of drugs I don’t want a huge cocktail but at this

point maybe I don’t want to know their levels an object detection here it’s

this sort of lurid image I only need one dot here one dot here without here I

don’t need to know what’s going on in the rest of the image so there’s a lot

of sort of sports optimization problems where I don’t actually care about the

result I just care about the sparsity pattern okay so these problems using

what’s called a proximal method so the putzel method relies on this proximal

operator here which is a vector to vector map and so it’s basically the

solution to this optimization problem but for our purposes we just care about

the one norm here where the proximal operator for the what norm is acting as

a shrinkage operator so take the vector it looks at every element if that

element is greater than T it subtracts T if the element is smaller than minus T

it adds T everything else is squashes to 0 so it’s trying to squash things close

to 0 so you can see in this sort of two dimensional example if you haven’t slept

all the way out here you’re pulling it back at the x equals y direction over

here you’re pulling it and hits the axis and then it snaps at zero so you can see

why a method that uses a proximal operator will generally get you far

solutions right so it tries to get to the ear once it gets to 0 it just snaps

at 0 and stays there so we’re going to try to characterize the snapping

behavior a little more okay so proximal methods are methods that repeatedly call

the proximal operator on something that depends on the past and so there’s a

large family proximal methods the one that we see most often is the proximal

gradient method so it’s also you know if H here is a constraint it becomes a

projected gradient method so they’re basically the same method but so this

method is used very often in practice there are variations on this for example

using acceleration so you can’t miss the method and

and you can also look at things like Douglas record method an 80mm and these

are all sort of types of proximal methods and they all have methods one

thing we’ve observed in practice is that when you run any of these methods on

some kind of sparse optimization problem most of the time they will identify the

sparsity pattern very early right so if you look at the error in the residual

it’ll sort of keep going down forever if you look at the air and the support

it’ll just hit over here some finite time it’ll stop and at that point you’ve

you’ve discovered the correct support and it stays there so this is great for

us because it says that ok our task is to actually just identify the support of

something why don’t I just wrote a proximal method run it for 100

iterations and stop as long as I know that in 100 iterations I have the right

support I don’t need to run it to completion and so yeah so this is sort

of what we’re trying to do and often we only care about the red curve here’s

another example so in this task but we’re trying to do

is separate forests from knives and hand rated digits we won’t do too much

pre-processing here we’ll just take these raw images and applies a sparse

logistic regression so here I’ve plotted the objective air and at each point we

can also look at the feature right so the so feature here is sort of poised in

two dimension because this data is in two dimension and you can actually see

which features are being selected and you can see after a while you start to

see like this feature coming up the most right that feature sort of corresponds

to the ridge in the floor night right so looking at fours and nine right this

right here is actually really important so that thing is starting to light up

here and we notice that basically we’re able to discover this pretty early on so

each of these images the title of these images contain a Trane air in the test

air and we notice is a level off very early on in machine learning this is

problem of we don’t really know when to stop because we know that these

performance metrics tend to level off you soon even when the objective air

hasn’t gone to zero and I think it’s still an open problem but one thing we

do know is that we can tell you when the support has been identified and at that

point you know that’s a good indication of when we

should stop because the support is actually identified pretty early on okay

so now we’re going to try to characterize all of this mathematically

so we’re going to say that a method with iterates XK 2 X daughter identifies the

support at iteration Bar K and for all K greater than Bar K X I K is equal to 0

whenever X I star is equal to 0 and it seems like a little bit of a one-sided

definition of support but actually it’s the side that we can prove so the

question we’re gonna ask in this in this in this part is is it true that all

proximal methods identified with support and finite iterations can we actually

show that this is probably true and the answer is often it’s true but sometimes

it’s false and if we actually looked at the literature we’ll see basically all

the authors they take a method they analyze it you know to death they

introduce things like differential geometry and all these different tools

and they tell me something about that method but in fact we can say something

across the board you don’t really have to look at each method separately we can

actually say something just knowing that they’re all using the problem operator

and this will help us a lot because it’ll help us understand not so much a

method dependent metric but a data dependent metric on when the support can

be identified ok so the key thing that we need to know here is to how to

actually solve this problem like what are the optimality conditions for

solving this optimization problem so let’s look at each individual index of X

separately so we take 1 X I and we say okay at the solution X I star is not 0

if it’s not 0 then this l1 term over there is smooth and so I know that the

gradients is is going to be 0 at the optimum right so we’re in this first

case here we know the gradient is 0 then we set everything equal to 0 and we do

some algebra and we get that that basically means that if X I star is

negative that if I defy star as the negative gradient then Zi star is

negative lambda FX I star is positive and B a star is limbed that’s just sort

of done by taking everything and setting it to 0 then what happens if X star is 0

and we know that it sort of you know it’s like it has this absolute value

rates at the cusp of this absolute value that’s being tilted this way or this way

by the smooth term but you can’t tilt it too far because if it’s tilt its you

fought then the room is no longer at zero right it’ll sort of roll off so how

far can you tilt it well you can tilt it from minus lambda to lambda so these

three conditions exactly quantify that X is optimal so if X I thought is

satisfying all these three conditions we know that it’s the solution of that

problem we can actually you sort of plot it here so we have X I star and then if

Z star is X star here and E star at the negative gradient basically if X star is

positive and Z star must be sort of all the way up to Linda otherwise it must be

all the way down then minus lambda and if X I is you know then it has to be

somewhere in the middle but it doesn’t really know where so if we can show that

this kind of if we if you give me an X and I plotted the X and the negative

gradient and it looked like that then I tell you for sure that this is the

correct solution we don’t have to do any more work okay so remember that we give

and here is basically we would call it the wiggle room lemma we sort of tell

you that this is sort of how much air you’re allowed I have and still be able

to identify the support correctly so we have this sort of data dependent

quantity so Delta min and basically we say that for any algorithm of this form

here then we noted at some iteration K the current support is identified if

this residual here is smaller than this data dependence constant okay so this is

sort of a mouthful let’s look a little bit intuitively what that means okay so

that’s the condition we need to be true at some event so we’re going to define

Delta I at the gap between any of these the eyes to the end right so we knew

that if X I star is here is nonzero then this gap is zero so that doesn’t that’s

not really interesting when X I star is 0 there’s some gap and we’re going to

measure that gap and the small is one of these gaps is Delta min then at some iteration k we have a noisy

X and that gives us a noisy K it’s the noisy X how much noise can we still have

and still have the condition be satisfied well if X is X I start at zero

we are not allowed any noise like any a lot of noise and it could sort of break

their condition but if X is nonzero there were allowed exactly Feltham in

amount of noise if we have just that little bit amount of noise we can note

that for whatever X I star is zero the condition is still met and so then at

that point the support is identified so basically so basically what this

condition is saying is that if basically arlynn toys and the gradient is smaller

than this Delta min then we have identified the support okay so how can

we actually use this lemma to give some kind of support identification

complexity rate okay so how we actually use this lemma so this is the condition

we need to be true we take our method and we insert YK according to the

algorithm so if we’re projected gradient descent YK is just a current iterate

subtracting the gradient we sort of plug it into this part here and sort of

rearrange terms and we see that this term here breaks into one term that’s

the error in the variable and one term this the air and the gradient so this

already tells us that finite support recovery will happen when delta min is

positive because we know that XK eventually converges to X star and the

gradient eventually concurred to the true gradient so convergence basically

means that at some point that the amount of error has to be smaller than this

Delta min term and so to figure out how far it goes we just sort of need to sort

of go through literature and we find our favorite rate of how X K actually

converges to X star we use our assumption here to bound the gradients

and so for example in projected gradient descent we know that if if f of X here

is Trek’s then we have this kind of convergence rate here we invert this to

get K bar and we see that the active set complexity the support identification

complexity rate is exactly log of Delta min over this

so here we’re actually able to recreate one of the previous rates that we use

you know many fewer lines to get there just by exploiting this wiggle root

level and so this is this is how it works for protected gradient descent we

can additionally use that same trick for all the other proximal Canadian methods

which is really sort of the strength of this technique or able to do it for a

great proximal gradient Douglas ratcheting 80 mm and also proximal

Newton’s which is probably no hasn’t previously been explored and the so just

just to make everything fair the previous works that have looked at each

of these they have looked at each method separately and they’ve given a lot of

different rates on a lot of different aspects of these methods we’re only

looking at support identification and because we’re just looking at a small

piece of the puzzle we’re able to do it for all of these methods in one broad

stroke and using many simpler produce okay so that was nice but we really want

to scale up right so most of us are probably not even using gradient descent

anymore we’re just using the Kafe gradient descent so this is work in the

stochastic gradient descent setting so now instead of taking gradient steps

we’re now taking stochastic steps so basically what does that mean at each

iteration we’re going to so the problem we’re trying minimize is actually

composite of different convex and smooth functions we’re going to pick one of

these uniformly at random at each iteration we’re going to take a gradient

step and then we’re going to shrink it and we know that basically that this

method will confess sorry so we don’t have this method will converge but we

have an additional sort of constraint which is that this step size has the

decay rights the cat’s-eye gradient descent the step size half a decay

otherwise we won’t Kerch okay so we’re going to try to do the same trick that

we did previously we’re gonna write out our residual term

we’re going to organize it in terms of an error in the variable in the air and

the gradient and we’re going to say well if both this term goes to zero so

basically if this entire gradient term convergence higher variable error term

converges then we’re fine so does this term converge in general so

unfortunately no right so we know that in general so take for example f of X

goals X minus one squared plus X plus one squared then we know the solution is

x star equals zero but we know that the gradient of this part will be minus one

the gradient of this part will be one the gradient of the whole thing will be

zero right so in fact they can’t possibly possibly actually converge even

when you’ve actually reached a true solution so this is a big problem with

stochastic gradient descent so this this is not true in general the second step

is actually still not true in general right so previously we said that ok the

variable will always converge to the true solution but that was when we had a

constant step size when we use a decaying step size that this decay is

faster than the correctness of XK deckstar this term also doesn’t go to

zero so in stochastic gradient descent we’re pretty like we’re pretty stuck it

doesn’t really seem easy to prove that there is any kind of support

identification and in practice we actually find that that’s completely

consistent with our observations that the support identification does not

happen for a proximal stochastic gradient descent so here that we just

sort of did a very small example using constant subsides nothing converges you

must have a decaying step size and then you can see that the variable will

converge but the sparsity error will not converge and that’s the same no matter

what kind of rate you use so I think one thing to clarify here a question might

have is ok well if the variable goes to 0 the wife at the sparsity error does go

to 0 well so you can see if this is sort of the true sparsity pattern you might

have something that’s like this right so you might have a very very small number

here there’s a lot of number might converge to 0 but it will never actually

reach 0 so for first-order that there’s a Memphis we looked at before they will

actually reach exactly zero in finite time but for stochastic methods they

they just keep getting closer and closer but they won’t actually reach 0 so yeah so are we sort of lost is there

no hope so luckily if hope there’s sort of a newer sort of stream of methods

coming in these variants reduce the casting methods and the way we kind of

work is instead of taking these partial gradients every once in a while they’ll

take a full gradient and because of that they’re sort of able to reduce the

variance in each of these stuffs and so you might say this is cheating which I’m

allowed to take a full gradient why don’t I just do full gradient descent

but actually they do it so that the number of fold gradients you take is

like an order less than the number you would take with full gradients and still

achieve good convergence so anyway so these methods are sort of yeah they’re

coming out now and then so there are a lot of advantages here when you actually

have this variance would use behavior you can use constant step size and also

you get support identification and you get it fairly quickly so that’s actually

shows that you know you can actually scale up and still get support

identification and we’re actually able to prove it also using wiggle room lemma

for the introduced methods but not for stochastic gradient descent so this

shows that our pro technique is somewhat consistent with what’s observed in

practice okay so that’s sort of the main point here

right so we’re trying to do is we’re trying to solve this first optimization

problem where we know the solution is part of mostly zero we’re looking at a

family of methods these proximal methods that we have observed in practice tend

to lock onto the zeros early on right they tend to be able to identify the

support early on and so we characterize this using this wiggle room lemma which

we basically show using that geometric proof and then from that level were able

to give the finite support complexity bounce for deterministic proximal

methods and festa Castaic variance reduce proximal methods but opera

stochastic gradient descent and said we just offer some intuition as to why

cannot happen so this is sort of the second project so I’ll pause here to ask

if there any questions here yeah so this is yes yeah so my question goes back to the

slide we were discussing about that digit classification thing when you were

trying to see what how many L in minimum number of pixels you need to classify

between four and nine yeah so there have been some work on this same thing like I

read about record inattention models so have you compared your results with the

one’s attention model sort of they use reinforcement learning to do image

classification when they’re trying to find the least number of pixels you

require to classify whether a digit is two three four or five yeah I guess

actually that probably works a lot better this is less about trying to come

up with a like these are sort of a very simple method that’s very simple to run

like the polite run in two or three lines mm-hmm that people are using

mostly because it’s easy to run but if your goal is to actually identify you

know thus far as a support set yeah something like this would work a lot

better especially if your images have sort of you’re trying to semantics yeah

so you see the number of iterations what

does it really mean I mean what is 100 iterations first only iterations means

what is that for integrally the examples are digit right yes

what are we doing why are we stopping and I’m very graphic but it is yeah I

guess this this chart is more to look at the relative difference between three

hundred iterations when the supportive thing identified and you know a thousand

iterations when you when your solver might actually tell you this is an okay

place to stop iteration what is it important what’s the output so the input

is uh well that’s like the 60,000 Emnes digits a so the output will be you know

1 if it’s 4 and 0 if it’s 9 well I guess that breaks down maybe eleven thousand

in the digits yeah this is just I mean I guess I guess if I were to use a

different data set if I use a different step size I would get different number

of iterations before any of this happen so that’s that’s definitely a good thing

to keep a good point I think the relative difference between

when the supportive identified and when the residual goes to zero is maintained

but it’s yeah it’s true sort of that the numbers is somewhat meaningless

depending on the dataset and depending on them yeah unless you treat the sub sighs yeah thank you for the questions so I think so now I’ll talk a little bit about sort

of my vision for future projects to look at okay so we sort of looked in the

first project how this kind of pre-processing trick of computing these

word vectors significantly helped machine learning in general in the

natural language processing world right so we know this first of all in general

real world data is very diverse and we know that we actually do a lot of

pre-processing for a lot of different reasons whether it’s anonymization

encoding for robustness but it also in representational learning right

instantly know that in fact each of these like what we actually end up

feeding into our classifier the distribution of that data will affect

how well our classifier works whether it’s you know things like no separate

ability or sort of incoherence we would like to be able to characterize these

better and one thing we actually realized in sort of the second project

is we have this constant let’s see if we can find it the state of dependent

constant Delta meant here which unfortunately depends on the solution of

the problem so it’s data dependent in the sense that it does not depend on the

method at all but you can’t ascertain this without actually solving the

problem at first it’s a little bit unsatisfying so one thing I think that

really should be done is to try to understand how each of these constants

depend on the data so that’s sort of this is our basically a very open

question but basically how how does things like Delta min or how does things

like strong convexity or all these different constants relate to things

like the margin the incoherence all these different things and there’s been

quite a bit of work on this already so things like you know since these papers

in 2010 look they’re things like restricted strong convexity look at

things like pseudo self concordance a sort of data dependent ways of

characterizing how well the optimization will do but I think there’s a lot of

more work to do in this area so I think this is a very open question

in general to look at and so you mentioned as well reinforcement learning

on reinforcement learning kind of problems for online optimization

problems in general present a very like twist to optimization right so we’re

trying to optimize this this objective function but f changes according to the

data and you only observe this in a one sample at a time so so really here our

goal is not just to tell you how fast as a method converge but how many samples

we have to see before the method converges and so this is a very

important problem and you know Penn has reinforced opening boosting game theory

and you know this is one of those areas where we know there are a lot of really

nice techniques that can get us a good solution without seeing everything but

we don’t really have you know it’s more than part it’s more of an intuition of

how to get there so it just as an example right so we’re in reinforcement

learning we’re looking at things like we’re trying to learn a policy or a cue

function which is parametrized by a sparse sum of basis functions we want to

know things like what kind of basis function should we use to actually

characterize the whole state space can we actually offer convergence guarantees

based on sparse observations how many observations do we need for each kind of

basis so I think this is a very interesting open question still and you know every one of these things

is interested in deep flirting you know of course we shouldn’t be

interested in deep learning because it’s very much becoming a such a big tool and

there’s a lot of very interesting optimization questions here that are not

yet being solved so I don’t know if you guys know about I think I think it’s a

2016 test of time talk and if swear Ali Rahimi sort of gave a whole list of

things about deep learning that we sort of know but we don’t really know and so

things on there like you know passionate realization or drop out other things

I’ve seen people use a lot like minimizing entropy to try to get my

clustering activity these are all things that are not really standard

optimization practices and so you know while we understand a lot about how to

norm regularization and one normalization gives us model complexity

reduced model complexity we don’t know how these things give us reduce model

complexity and so we should know these are definitely things that we should

look at more I think even the simple question of the depth is not very well

answered we know that deeper networks have higher model complexity but in

general people don’t prefer just to have a really deep network they need to have

some trick to solve it to train it otherwise it’s impossible to train so

they do these things like how do I know it’s memory elements attention networks

if those things are not sort of characterized in the you know

fundamental theory you have the express ability of deep neural nets so that’s

something that to look at it probably the most popular optimization problem

today Andy flirting is these general adversarial networks like the gans so we

know that these kids can sort of generate very interesting pictures from

distributions and we know that they’re sort of framed as a saddle point problem

or a two-player game where one player sort of offers some images and the other

player decides if this image is a real image or a generated image and that’s

supposed to make those full players better but what usually happens is the

first player gives a noise and the other player set places that’s noise like it

correctly class noise so this is a critically valid solution but it’s

totally useless and so the open question is sort of how can you actually go about

sort of constructing these scans to avoid this modal collapse problem so

this is sort of a very important optimization problem so

if so thank you for coming to this talk oh are you willing to share no I think

it’s actually okay thank you so it is curious to know what are your

comments on the model complexity versus the real-time update like the

over-the-air update that’s been recently been famous in stuff like hot ramos

wiki’s wherein if you have a model on the server and you want some real-time

updates do you actually focus more on the

complexity of the model or you may be what you call till time update more than

the accuracy thank what’s your time sorry actually not very familiar with

this recent work so how does this recent work if you put it in a simpler way

would you prefer the accuracy mode or the latency more in terms of like real

gold cases I would have to ask the machine learner which one is more

important that is a new is if a new data actually important because maybe it’s

because the user is reacting to the way the model is and therefore it’s very

important or is it just sort of Epsilon more samples in which case it’s not that

important I would just have like some kind of question that you mentioned

another musician typically in healthcare they have that UGC of an itemization and

how would you do with price optimisation and an organization in this case because

typically if you can extract the vectors in the directions and parallel they are

they’re still information yeah yeah that could actually make you retrieve the

initial guess of what it is right yeah so one thing we did look at where as far

as sketching matrices but I’m not sure that actually can effectively do what

you’re saying but start sketching matrices would actually sort of it’s

sort of the equivalent of multiplying a by a Gaussian random matrix but where

you don’t lose Parsa T so you still have computational efficiency but in fact

it’s actually not affect they still an open question to me is whether that’s

solving this privacy issue because maybe you can still recover it certainly

wouldn’t be easy to recover it but you will still have their relationships in

Newton B so typically if I had to two patients who have similar you can

specify it on the data side to like sort of you have your data feature matrix you

can you can multiply by random matrix on this side or you

hopefully on this side it seems like you would scramble up the users at least there’s a finite support ID methods

because it’s used where it cuts my brilliance

it’s this variety and then does it you’ve also use for Ally constrain

gradient methods like a throwing well friend that does have this but it’s not

the same proof it does have sparsity at five years more because it’s taking

atoms on her time and it stops when it has enough atom so usually it will pick

the right ones first yeah so the analysis would not be the same that’s

actually one thing we’re looking to extend to but it’s not the same because

it’s not actually using a proximal operator it doesn’t shrink so it goes

from the opposite direction kind of thing so but yes but the bulb is very

useful for this type of thing if there’s no more questions I’d like to

thank our speaker he’s been son