Ophir Frieder – Searching Harsh Environments


[music playing] – Welcome to the 2016
NASA Ames Summer Series. If a tree falls in the forest and no one is around to hear it, does it make a sound? Another way to look
at that question: if a species erase or a data point disappears and we do not have
a record of it, did it exist? Or does it have any value in the results of where we seek? Today’s presentation is entitled “Searching Harsh Environments” and will be given by
Dr. Ophir Frieder. He is a professor
of biostatistics, bioinformatics,
and biomathematics in the Georgetown University
Medical Center. He is–he also holds the Robert and Catherine
McDevitt Chair in Computer Science
and Information Processing. Besides that, he also holds
an external position as the Chief Scientific Officer
of UMBRA Health Corp. He received
a Bachelor’s of Science in computer science
and communication studies, a Master’s of Science in computer science
and engineering, and a PhD in computer science
and engineering, all from
the University of Michigan. He is a Fellow of the AAAS, the ACM, IEEE, and NAI. Please welcome– please join me in welcoming Dr. Frieder. [applause] – Good morning, which is, for me,
a little odd to say considering it’s really
good afternoon. But for you it’s good morning. Thank you for being here. I see the lights
are nice and shiny. I’m not used to bright lights, but we’ll do what we can do. So today’s talk is about
searching in harsh environments. And before I came here, I looked at the variety of
talks that are gonna be given through the summer series. And this one is not quite
in the same spirit as some of the other ones. This will be a little different. This–but we’ll see if,
hopefully, it’ll pique your interest or at least something. All right, so first of all,
I have to correct you a well-known myth. How many of you use Google? Okay, considering I’m like
a stone’s throw away, it would be pretty sad. How many of you basically
use Google today? Okay.
No surprise. Well, if you actually
give a talk in any search community, what you’ll hear is that
Google solves it all. And for the most part, they actually have solved
quite a bit of it. But it isn’t exactly the truth, because Google really solved
computerized data. And a lot of data
isn’t computerized. It’s digitized, and I’ll show
you what I mean in a second. And more so, Google is hardly
a social media player. How many of you’ve actually used
Google Plus? Wow, there’s actually
three of you. I’m impressed. I’ve given these talks
in large audiences and not even a single hand, so obviously, Google Plus
didn’t quite make the splash they were hoping it to make, and obviously, it’s– social media’s here
all over the place. So Google is hardly
a social media player. So what am I gonna
talk to you about today? As you can see, I’m gonna talk
about a gamut of things. And the way I usually talk–
give a presentation is, I usually pick something that
I consider as more engineering, meaning that the solution
to a real problem exists– I put it together– by “me,” I say my students
actually put it together– and basically is in real use. To the other extreme
is research, which is something that we
hope to do in the future; we hope to get it to go
into applied practice. And I’ll talk to you about
searching and mining of social media. But first, I’m gonna start
talking to you about what is complex document
information processing. And by “complex information,” I try to find something
that would relate to NASA along the way. Now, I got to tell you
that I’m not an expert in anything that NASA does. In fact, I’m not even a novice. But I do realize that when you
actually look at the brochures and you actually look
at some of the stuff that you find online, you talk about
the “Mission Assurance System.” And they talk about having
timeliness of information– that it will get you
complete information as quickly as you can. Now, complete information
really means that you need to get it from all sources, not just the standard
computerized as you know it. So I’m gonna talk about that, and obviously, I’m gonna talk
about the research direction in this area, so let me– by the way, feel free to raise
your hand and stop me. All right,
so this is a complex document. When I talk about what
is a complex document, I talk about
documents like this. And as you can see, this is not
your standard web document or isn’t your standard
Microsoft Word document. It’s got stamps,
it’s got signatures, it’s got logos– it’s got a whole
variety of things, and that’s pretty complex. But this one’s
a little more complex. So this was
a government-funded project that I was working on, And this too has logos, and it too has a signature, and it also has
different type components, including tables
and handwriting, and by the way,
anybody can read this? No one can read this. Okay, anybody can read it
from left to right? I’m glad nobody said yes, because in Arabic,
you read from right to left. But anyway,
the fact of the matter is that this is
a complex document. And you need to search that. Now, fortunately,
we know how to search it. We know how to play with it. Like most of the software
development today, your target is to be able
to have software that basically you can utilize
other people’s software or freely-available
commodity software, ’cause if you actually tried
to redevelop your software, what you’re gonna end up doing is having lots of errors, if you ever succeed at all
to have anything done. So what we do is, we capitalize
on other people’s technology– freely available software,
freely available component, some more robust than others– and we try to get an answer. And we do know how to deal
with handwriting recognition, and we do know how to deal
with structured extraction, and we do know how
to deal with OCR, and we do know how to deal
with all of that. And the way we deal
with all that is simply, we take the document– we basically– so there’s, like, a document. We then basically
enhance the document, because you can see
it’s kind of bad. We layer it. We OCR the various components. We take all
the various components and we run it through all
the various software routines that we know. We put it together and then we try
to either search it or mine it. Now, obviously here, I can tell you that there’s a need to “enhance” the document. I have to prove it to you. Obviously, taking the documents you’ll give me is granted, and extracting stuff from it
you’ll give me is granted. But you’re–what I still owe you is an explanation of why
I have to enhance it and why does integration
make any difference at all? So I owe you that explanation. So let me start with this. You can guess
which side is enhanced and which side is not. Right? I’m looking at people
who are, like, wondering– okay, you do know. Very good. So you can tell
which side’s enhanced and which side’s not,
and one– and you can tell
it’s a handwritten document. What this is
is documents taken from a diary–an old diary that’s gone through,
let us say, adverse conditions during World War II. And basically,
it’s stored in a museum. So what we have here
is basically a scanned image of it, and as you can tell,
you have to scratch it, and you–and you can’t
really read it. And even if you did try
to do OCR, forget it. You can’t get anywhere with it. But how about this one? Now you can get somewhere on it. This one’s even a little harder, because it’s got a composition
of a handwritten and typed. And this one shows you
that if you enhance it, you really can get anywhere. What you can look at is– if you look at the black
square–rectangle, you can’t really read anything
in the unenhanced, and you can easily read it
in the other hand. Anybody want to try
to translate it? I could make it bigger. You speak fluent German? Well,
if you speak fluent German, then you can translate it. Ah, fair enough. I can translate it
for you if you’d like, but it’s a little hard
even for me to read, and I’m a little closer. Anyway, so you’ll buy the fact that you need to enhance,
right? You–you should enhance,
because if you don’t enhance things like that,
you won’t be able to process– certainly not with OCR. But I also owe you the proof
that you actually– that it–integration actually
makes a difference– that you really do want
to break the pieces into the whole and the sum
of the pieces and sum it together. And the best way
to do it is this. This is me. Kind of like a business card. I don’t really use
business cards anymore, but it’s an old
historical artifact. And I have to ask you
the question of, what positions do I hold? Now, if you actually
look at it, and you did not deal
with the text– if you basically only deal with the components
that are not text, you cannot answer that. Well, that’s a given. But the real question is, at which institution am I at? And as you can see, without processing the logo, you couldn’t answer that. So without integration, you couldn’t answer either
of these two questions. So let me show you a little bit
of what we did. But before we do that, and I look at the age
of the audience, and this–it makes it
very clear for me– I have to address the notion
of technology. We built this prototype,
and by the time we were done, no one would use it, because technology moves
so quickly and so rapidly that it becomes out of date
fairly quickly. But unfortunately
or fortunately, what doesn’t come out of date are benchmarks. Now I’ll show you how sad– or inspiring, whichever way
you want to view it– benchmarks are. So, in early-on days, there’s some competition
called TREC. Anybody ever heard of TREC? Nobody?
Ah. More than the number of people
that use Google Plus. TREC is an international
competition for text. It’s basically not
a competition. It’s sponsored by NIST. It’s a bake-off where everybody
gets together. They submit–they give you
a set of queries, they give you a set of data, and they tell you,
“Go run your system on it. Submit the results.” You send it to NIST, and NIST basically does
the evaluation for you. It’s been running on
for a long, long time. In 1993– which based on some
of the audience age here was probably before
some of you were born– in 1993, they basically had a very,
very large collection– a collection that was so large, they basically felt
that the academics can’t really do a lot with it, and what they did was that
they actually had a subset of it they also evaluated on. So anybody want to guess what is
a “large” collection in 1993– a very large collection that
academics couldn’t handle? Order of magnitude. Was it–thinking terabytes? No, how about hundreds of tera– how about hundreds of gigabytes? It was actually two gigabytes. People couldn’t store it. Forget–forget actually–
store it. And they had to rely on
a small subset collection which was 500 megabytes. If you take your iPhone
in your pocket or whatever device you have and think how much you’ve
got storage on it, and think of how many songs
you have on it. Well, anyway, that benchmark
that they still use– still is used
periodically today– long time ago,
still used today. So we felt that we–if we’re
gonna have any benchmarks– that we were gonna
build any benchmarks– we actually needed
to build a benchmark that would stand
the test of time. But more the fact, we knew we had to build
a benchmark, because no such systems
as we described were available, and no–no way of
evaluating them was possible. So we had to spend
a lot of time creating our own benchmarks. And you know what
the cardinal rule of creating your own benchmark is? You’re–you’re guaranteed
to be the best. You’re also guaranteed
to be the worst, but you never say that. So we built a benchmark, and what are the benchmarks? In order to have a benchmark, we built a set
of characteristics. What are the characteristics
that we felt important? It had to vary in inputs. It had to vary in fonts. It had to vary in
graphical elements. It had to vary in everything– had to vary in the whole things and key to it is it had to be… free. How many times have you bought? So when I was growing up and we wanted to buy music, we did something
that’s very archaic, called “we bought records.” You’ve seen records? Okay. So when we–
we would buy records. Later on, we would buy CDs. You’ve seen CDs? Now, let us say,
you “borrow” music from various different sources. You never pay for data. You don’t pay to
read the newspaper. You don’t pay to
search any media. It has to be free. So for anybody
to actually use it, you had to worry about
the copyright problems and you had to make sure that
if you solve the copyright, you solved it in such a way
that it was free. So we actually built a system– we built a benchmark
that’s free. Make a long story short, we built this benchmark. It was used. It’s about 7 million documents. 42 million TIFF images. It’s about 100– it’s about 100 gigabytes of OCR after you OCR’d it. It’s 1.5 terabytes in size. It’s still credible, and it’s– most importantly,
it’s still used; it’s still there. If you want it,
you can go to NIST, and you can get it. Basically, what we did was, we actually came up with a way of actually searching
this collection, and it– it’s been used and liked, but suffice to say, is we built a simple
prototype system. It had very simple integration. These are all
the components to it. You just plug in
component together, and you basically ran it. Now,
without going into details, let me show you why
this system is– should be of your interest. So here’s a query for you. The query says you want a logo of the American Tobacco
Company. It has to talk
about the documents– to talk about income forecast. And it has to be of– talking about income forecast of greater than 500,000. Now, if you look at it, the logo is pretty clear. And if you actually look at it, the word in the red box that you probably can’t see says “income forecast.” So far, pretty straightforward. But in the black box, it says “800,000.” Now, the reason I use
that as an example is, if you actually
did a text search, you wouldn’t find
this document, even if you found the logo, because a text document– when you basically
type in the number 500,000, what are you actually doing? You’re looking for a string
that says “500,000.” But in a database search, when you say “500,000”
or “greater than 500,000,” say an SQL, you actually get a result
that’s greater than 500,000. So this document fits. Here’s another example. RJR logo with the filtration
efficiency and a signature. Why do you care
about signatures? When do you sign? Typically, if you are
in management particularly, all you do all day long
is like this, because that’s
an authorization. So this is a way
of identifying something that has a logo. It talks about filtration, and somebody has
an authorization that did something about it, hence this document. Another example is five– find the five
highest signatures of dollar allocations
by people. Now, you can imagine
there are some organizations that care about payments– that pay two people by certain
authorizing people… I guess. Too many “peoples”
in that commentary. But to–the goal–
in order to solve that, you have to identify
all the documents signed by the people. You have to “sum total” any
money that they talk about, and then you have
to sort order the ones to be most highest pay. But that’s only relatively
easy part of the challenge. How many have–how many people
have signed ten things in a row? More of you than that, I’m sure. The problem is, when you sign
ten times in a row, how similar is your signature? Hundred times in a row? For that matter,
every time I sign something, it looks different. So the fact of the matter is, you have to do
signature matching. And unlike simple,
straightforward things, signatures are
very hard to match. But we cheat. How do you think we cheat? Well, we take different
extraction procedures, and we’re integrating
everything together, and some byline is
gonna have a name, or some header’s
gonna have a name. And that’s the way
we match it. So that–so these
are the five people who paid the most
amount of money in some payments on
tobacco litigation. Here’s another example. This is a query about
the associations of people: where they’ve paid,
who they’ve paid, who Dr. Stone has paid. Again, not your typical search that is supported. And, of course, when I
actually say that I built it, actually I didn’t do it very– I was just running the effort. These are the people
who actually did it. And they did it over time, and over time, they’ve changed
their affiliations. Some are students,
some are collaborators, some are colleagues,
and so on, and so forth. And since I am an academic, we have to worry
about publications. In the new age
of universities today, a lot of people are
looking at patents, so there’s some patents
about it later on. So that was kind of basically
in the middle ground. It was somewhere between
research and practice. We’re trying to solve
a real problem. We were trying to get something
in the prototype stage. But it hasn’t really
seen major deployment or any specific deployment. The next effort
we’re talking about is a little bit more… to the point. And it’s actually very relevant to systems that basically
have user reportings. And user reporting systems have
wildly interesting spellings and have wildly
interesting phrasing and have wildly
interesting grammars. And when you have a real
problem searching those, like you have in the various
reporting systems like NASA has, you need to come up with
such a solution. So, searching in
adverse conditions– what does it mean? Spelling is difficult. It’s getting worse
of a problem, because we’ve got autocorrect
for everything, so basically people learn how
to spell less and less well, and then they basically
try to spell using the “Twitterese” or any other
encryption mechanism that they call spelling. And it gets very difficult
in the spelling. As a faculty,
I look at spelling when people write on exams, and it’s getting
worse and worse. But it’s even worse when
you’re starting to spell in foreign languages, particularly if you don’t know
the foreign language, but you’re gonna do a search that is related
to the foreign language. So one such situation exists in a collection called
Yizkor Books. And the other issue is
when you start dealing with medical texts, and I’ll tell you about
both of them independently. So Yizkor Books– “yizkor,” the word in Hebrew,
means “remember.” It’s a collection of books that
are scattered around the world, including in the Holocaust– United States Holocaust
Memorial Museum. And in this collection,
the restriction– there’s restricted access, and there’s also a section that actually has an archive. And the archive–
people come in there and say, “I’d like to find information about someone.” And so, “Okay, fine, who
are you looking for?” “Oh, they lived in some city.” “Okay, fine, when is this from?” “Oh, this is from
World War II era.” “Okay, fine. Could you be a little bit
more specific? Where did they live?” “In Europe.” Not really helpful,
but it’s a start. “Okay, what about– could you tell me
what language they spoke?” “Oh, yeah, yeah, yeah.
They spoke German.” Getting a little better. “What was the name of the city
they lived in?” “I don’t know, but it had a… ‘burg’ in it.” “Anything else?” “Um, not quite. I don’t know. It kind of sounds like…” So, I said, “Okay, so where
would be an example of this?” So one example of such city is called Bratislava. Anybody heard of Bratislava? It’s in Slovakia. Does Bratislava have
the “burg” in it? I don’t know how to spell
Bratislava too well, but it doesn’t sound like
it has “burg” in it. But it does,
because the word “Pressburg” was another name for Bratislava. And that one does have it. So it gets a little difficult– and they did speak German, because the Jews of that area
called it Pressburg, and they actually were
speaking German. So we wanted to build
a search system where people didn’t quite know
how to pronounce things– people didn’t quite know
how to, location-wise, find it. People were–but yet
they wanted information. And we built a system that basically has
such an interface. But most importantly,
they give you– when they give–
they do a search, they actually can find
a collection from the documents that actually have
multiple languages in the same documents, let alone different documents of different languages. And it produces
a simple ranking system. Now, as you can see, it’s not doing
particularly great. But it is finding it. And the way it finds it is based on
this simple algorithm. If you look at
the top green box– segment section– that is basically
breaking things up in random pieces. And if you look at
the bottom green box, that is a traditional approach
of what people do. It’s called n-grams. N-grams is sliding
character windows that go overlap each other in order to find
what you’re looking for. The problem is–
is it’s sliding windows. And sliding means contiguous. And if it’s contiguous,
you cannot find some things that are basically chopped
in the middle. So the top one takes care of it. And we built this system, in use in the archive section
of the Holocaust Museum today. And it uses simple,
simplistic rules to break things up, which are kind of
crazy-looking, but they are to try variety and add mutation into things. And here’s a standard way
of evaluating it. So the way you evaluate
these systems is by taking– the standard approach is
basically language bases called D-M Soundex– it’s a Soundex approach; it’s by sound– which doesn’t help if
you cannot pronounce it. Anybody speak Czech here? Ah, well then, you can
invalidate my statement. I don’t speak Czech. But there are words
that are yea long that just forget about vowels. They just don’t think
it’s necessary. And for somebody like me
to try to pronounce it is a new experience. Make a long story short, doesn’t help
the phonetics of it. So the way that, generally– other approaches
is called n-grams, and ours is the one
at the bottom. And what you’re looking at is basically a standard way
of evaluating search techniques for letters. It’s basically–
you add a letter, you drop a letter,
you replace a letter, or you swap a random pair. And what you’re looking at– the ones in red
are the ones that– the best score. And what you’re looking at, and when you talk
about the rank, is where in the list
did you find that– what was the average rank for all the collection
that you tried? And the reason
there’s two numbers is for the bottom number– sorry, the top–
the bottom number is what did you find that
n-grams also found? So the USHMM search
finds everything the n-grams search does,
plus some other ones. So if you only look
at what the USHMM did, and what the n-gram–
the n-gram also found, it’s the bottom number. And as you can tell, these are ugly numbers, but what you want to see
is where the trend is. And this is if you wanted
to get really scary. Add two characters,
add three characters, add four random characters– remove two, three, and four– replace two, three, and four– and swap two, three, and four. And whenever you–
what you see is, the trends are the same. So these are the algorithms
that’s used today in the archive section to find names. But I’m in a medical center– one of my appointments
is the medical center. And if you’re in
the medical center, you basically want to show
that you’re doing something that’s useful for medicine. So we actually tried
to deal with transcriptions and how much errors occur. Now, by transcription error– how many of you have
been to a hospital? How many of you have been
treated in a hospital? Quite a few of you. How many of you would wish
they would not be there during the time shifts
between– when nurses go from 3:00 to 3:30 or from 7:00 to 7:30? The nurses overlap. And the reason they overlap is,
they hand off the information from one shift to the next. But a lot of times,
they kind of write things down. And so–and how many of you have
seen doctors’ handwritings? They don’t write things down;
they kind of scribble. The bottom line is that
transcription errors and the names of medications and so and so– forget about it. So a lot of these errors occur, and they account for quite a bit
of the possibilities. Now, not all–
not all, necessarily, are transcription errors. But some of them are. So we ran it
on a medical dictionary. And this is a standard
characteristics, and we were actually
quite happy. We did okay. And again, we did fairly well. In fact, we found
almost everything, even when you basically tried a complete mess of four swaps, and so on, and so forth. So that just shows us
a little bit more. And this is, again–
this is basically a straightforward combination of modifying n-gram solutions that are well-known, adding a little noise into it, because noise helps– by the way, how many of you deal with optimization
algorithms here? Some of you. There’s famous
optimization algorithms. There’s things like
simulated annealing in genetic algorithms. And you know what
a key of their success is? Add some noise: mutations, random heatings. Right?
So we did the same thing here. And it worked here too. Again,
these are the people involved. And by–when I say “I,” I didn’t do much. It’s obviously
my collaborators. And we thank them for– massive users of the system,
for their comments. And again, being an academic, and, of course, my students
very much appreciated the last academic stint to Santorini in April. Very nice place. And now I kind of
want to leave you with it, to show you that really, we do some things that are,
hopefully, for the future. And I’m going to talk about
searching social media, and I don’t need
to motivate that. I don’t need to give you
a NASA program that does that. It’s everywhere. Everything they do
is social media. So I’m gonna talk about public health surveillance. And the way that
it’s usually done is, it takes a long time, because it takes a huge amount
of human effort. It basically involves when
somebody comes around and looks for you and goes to the doctor
feeling sick. And enough times different
people go to the doctor, eventually the doctor
reports it. If enough doctors report it, eventually it’s caught. But it takes a long time, and therefore, you need to expedite it. And the way you expedite
almost everything nowadays is social media. So, how do you deal
with social media? Well, we’re not the first. In fact, many people
have done it. Typically, they talk about
a known basic problem. So they’re gonna say, “I’m gonna look for influenza.” And they’re gonna
find influenza. And they’re gonna look
through it, because they know what they’re– exactly what
they’re looking for. This is a problem if you don’t know what
you’re looking for. Often,
they use complex solutions. I firmly believe
in the KISS principle. Anybody know what
the KISS principle is? I’m sure you do. Keep it… [unintelligible chatter] I’ll leave to you
to complete the rest of it. So the–and those don’t
usually work, because complex solutions
take forever, and they aren’t really
heavily adopted. Or they use their own resources
that you can’t get access to. For example, query logs. Query logs are heavily used. Do you have an access
to the query log? Not really, so you can’t do it. So what we wanted to do was, we wanted to change
the old way, which was saying,
“Is there a flu? Are you looking
for something specific?” To general things like, is there something occurring? Is there something new? If so, yes,
there is something occurring. It’s the flu. So what we have is, we have
a collection of tweets. It’s 2 billion tweets, which were given to us
by Johns Hopkins. We basically took those tweets– we partitioned them by time. We then checked for them
being trending. And once we identified
what is trending, we saw if it’s something
that should alert us. That’s the nutshell,
but I’ll go in more specifics. So here’s a tweet collection– 2 billion. We cleaned out the ones
that are not health-related. We left 1.6 million. That, we–
it’s partitioned over time. So then we partitioned it
over the time. We cleaned it up some more– got rid of some punctuations, stop-words, and so on,
and so forth. Then we found out what
is actually associating with one another. And we saw that if
you took the tweets “pounding head”–
“pounding headache,” “sore throat,”
and “low-grade flu and fever” versus the other ones, you saw that certain things
had a certain sufficient amount of support of what’s going on. So “flu, sore throat” had support of 3, “fever” had support of 3, “cough” had support of 2. And basically,
we decided to clean out and keep only things
above a certain threshold. So that’s an example. Then we decided to see if
that information is trending. And what we did was,
we looked for– oh, we looked for slope and a side. How many of you are
still think– writing your dissertations? Or thinking about
writing dissertations? Thinking about writing
research papers? Okay. I said we looked
for the change in slope. Bad word. I’ll tell you a better word. We changed–we looked for the– for a derivative, because slope is delta-y
over delta-x… Not to be confused
with derivative, which is delta-y over delta-x. Sound the same?
It is the same. But–but– one is easier to publish with, ’cause it sounds very
sophisticated: “derivative.” And one is much–sounds like
you’re in junior high, you’re talking about “slope.” So we did derivatives. We didn’t do slope. Forget the slide. So here’s an example of things
which we looked for. It’s derivative. So here is a trending decision. This one occurs
very frequently, right? So the term–so “feel sick”
occurred very frequently over time. This one did not
occur so frequently, but this one trends. The first one does not trend, because although it’s high, there’s not much change to it. We took those. We used Wikipedia
to map things to like– to like sections. And why Wikipedia? Because it’s layman’s terms. What’s tweets written in? Do you see–
do you see “neoplasm” or “sarcoma” listed in tweets? No, you see words like “cancer.” So we wanted a clean–
use clean English. We looked for where in Wikipedia it mapped onto a concept. So here’s the words
that mapped on. “Sore throat” we knew is
actually a medical thing. How do we know that? We cheat. See the red circle? It says ICD-9 and ICD-10 codes. What are ICD-9
and ICD-10 codes? Billing codes– billing codes for medicine. Now it’s ICD-10. Previously it was ICD-9. If it has a billing code, it’s a medical condition… Mostly likely. And we compared reaction to flu. And as you can see, the corrected Google,
and the CDC, and ours– you do not–you shouldn’t
compare the scale. But you should compare that– you see we detected the trends
at the right time. So we were a little optimistic and said,
“Maybe there’s a hope for it.” But I have to give you
a word of warning. Using tweets, while
they are helpful in times, they’re not quite as helpful
in others. So it is true that
the landing on the Hudson and the Mumbai terror attacks were detected quickly
in Twitter. It is also true– slightly less accurate is the flu tweet detection. Hurricane Sandy
had a bunch of misses. In fact, some cases were– the misses were indicating
the wrong locations. And some locations
for the meetups– the wrong locations occurred in off to the east of New Jersey, which, if you know geography, is not exactly a very
useful place to meet. But most wonderful
was my favorite of the celebrity deaths, all ’cause of “Colbert Report.” How many of you used to see
the “Colbert Report”? He had a show of speaking
to Jeff Goldblum from the dead, because on Twitter,
Jeff Goldblum was killed. Only problem? Not problem–only good thing is that Jeff Goldblum isn’t dead. So, Colbert talked to
Jeff Goldblum from the dead, because obviously,
Twitter had killed him. So don’t necessarily bank on it
being the case. So, we actually looked to see
what it can do. And we trended on “sinuses,” and we noticed
that there’s a problem around the April
and May timeframe. We then look
“allergic response,” and we saw that that trended
along the April and May. And in fact,
so did food allergies along that same timeframe. So we said, “Maybe there’s
something to it.” But the truth is that what we used is not
to validate a situation, because if you validated
that Jeff Goldblum is dead, you would’ve validated it true. And it’s false. Social media should not be
necessarily the guide for validation of a concept. But it should be a guide,
potentially, for telling you if something
potentially has changed. Now, it may be a false alarm, or it may reality. And if it’s reality, you will detect it
by other sources. It at least will give you
something to do. It’ll give you something
to explore. So it is very useful as
a hypothesis generator, not so useful as
a truth indicator. And that’s what we did. And by “we,” I mean all these people, including Alek Kolcz when he was at Twitter and so on, and so forth along the way. And here are some publications. Again, student appreciated. He went not only to Cologne, but he continued
to go to Slovenia and had a two-and-a-half-week
vacation courtesy of yours truly. So, what do we do? I showed you that the whole is greater than
the sum of the parts. When you integrate things,
you can actually get somewhere. You can solve problems
you couldn’t solve otherwise. I showed you that searching, which we all know
is very easy, is not so easy if you can’t
really do the search And if you’ve got very,
very adverse spelling and grammar and conditions
along the way. And I showed you
that social media should be used as
a warning mechanism or an alarm indicator. But it may not be
the ideal situation for you to go solve via truth indication. So let me conclude with one… the way I always conclude. I always conclude
with a bunch of statements. One, I–
I have three cardinal rules. Rule number one is always finish on time. The reason it’s important
to finish on time is, if you don’t finish on time, people start to wonder if you didn’t know how
to organize your talk or if you didn’t organize it. So rule number one,
finish on time. Rule number two… rule number two is
always leave room for questions. And it’s very important
to leave room for questions, ’cause if you don’t leave room–
time for questions, people will start to say, “Okay, this was a canned speech. Person had it rehearsed. Was not–did not want
to answer any questions, because they were afraid that they’ll be asked questions that they don’t know
how to answer.” So rule number two is, always leave room for questions. But not any rule is as important
as rule number three. Rule number three is by far
the most important. Rule number three is,
never leave room for too many questions, because if you do, they will realize that you didn’t know what
you were talking about, and they will ask the questions
that you cannot answer. So I was told that
I would have a total of about 45 minutes. I was told I should
leave room for questions. And it is now exactly
43 minute– 42 minutes and 25 seconds
into the talk. So I very much
thank you for coming. And now I open the floor
for questions. [applause] – Very good, very good. [applause] So great–great rules, and so we have time
for questions and maybe some deep questions that will challenge
the speaker. If you have a question,
please raise your hand and wait for the microphone and ask a question. – In one of your
tables of results, you had drop a character,
add a character, drop multiple characters,
add multiple characters… – Those are different tables, but yeah. – So it seems like
for misspellings, you’d need combinations
of those. Did you also include
dropping and adding, and– or maybe that was
replacing, or… – So we tried–we added– we evaluated using
adding a character, dropping a character, replacing a character,
and random swaps of characters. And we did it up to
a combination of four. So we basically–if you actually
looked at what we searched with and you compared it to what– to actually what we really
meant to search with, it was completely different. You wouldn’t recognize
some of the terms. And we also used actual,
real logs of– of the users to compare as well. – Right now, the FBI and CIA are doing huge numbers
of searches to try to figure out
if there are people stalking us at this very moment,
et cetera. Are you working on any of that, and does some of the– some of the approaches
that you’re using help to root out this evil part of our society? – Am I working on… stalking people? [laughter] I try not to stalk people. Right. I work on technology. Technology is used in
various different ways. And I’ve built various
search systems along the way. So I don’t actually know what people are using
the systems or the algorithms or the–
that I’ve built. So I really can’t answer that. But I try not to stalk people, if that’s the question–
answer to your question. If I didn’t answer
your question, ask it again. – I was thinking
more in terms of utilizing the technology
that you’re using to sort of read
between the lines in some of these
communication systems that people are using. Did you hear what I said? – I heard some of it.
Could you please repeat it? – Oh, I said, I was thinking
more of using your technology to read between the lines in some of these–some of these
communication systems– you know, organizations like
ISIS is supposedly using over the Internet
and whether or not– I mean, you’re not
directly involved, is what you answered me
in saying, but– so you’re not working
with the FBI and the CIA, as far as you–
as far as you know, your technology is not
part of that? [laughter] – As far as I know– – If it were,
you wouldn’t tell us. – As far as I know, no, but I can tell you
that you can search… you can search basic
different languages along the way using different algorithms
and different approaches, some of which I’ve actually
used and developed. – Thank you. So what you are doing is great, but before you can do this, you have to have these
documents digitized in this electronic form. And that’s a big barrier, both technically
and, like, legally. So not all documents are available for this,
actually. Maybe too few documents
are available for this. So how about this barrier? – So the reason that
we had the first part that I talked about
is in order to try to get “scanned documents” into a form that you can
actually OCR some of it– so you can actually do
some of this interpretation. The goal is that
if you can do the OCR– and it’s a big “if”– if you can do the OCR, then you can use some of these foreign language
search techniques, or foreign garbled
search techniques, to try to correct some
of the OCR errors on top of it, or at least try to
search the documents that have them, even if they’re
poor OCR corrections. But yes,
digitization is a process, and OCR doesn’t always work. In fact, it works–
it often fails. And some documents
from various collections, we’ve actually tried to help
people actually OCR, and we basically
put our hands up and said, “Never gonna happen,” at least never as far
as computer science, which means five years. – Are there any other
lessons you learned from your work that you can
apply to other areas of your life? – Have I learned any lessons? I learned a lot of lessons. First and foremost,
when you do search technology, you should actually
get your head– or do any evaluation
of search development– you better have a solid– a benchmark to actually
evaluate your systems with. Gold standard is– there’s no replacement
for gold standards. Every approximation
that you have pales to actually having
the reality of it. I’m not saying that you
necessarily will always have it, but you should try
to aspire it. Another thing is, get really
solid graduate students. Being an academic, I made my…career, or whatever it is, based on the hard work and ingenuity of
my graduate students, which I’m indebted for– and also my colleagues. But really, the fundamental work is done by
the graduate students, so I guess–resources. Get the resources you need
is probably the best way of grouping
everything together. So that’s the short answer
for you. – So, in your Twitter example, you got a very large database
from Johns Hopkins. Is it possibly to actually
apply your model to live data, and is it something that
anyone’s planning to do in the future? – So it’s interesting you ask. The initial goal was
to try to do it on live data. In fact, we had an intern– collaborator plus a researcher at Twitter at the time that was gonna try
to do it live. We wanted to do–
what we wanted to do was a real, live feed,
and basically be able to– to parse any trend situation for different domains. You need a domain,
’cause we have to identify the vocabulary and the like of what you’re trying to track. But…that was the intent. Hasn’t gone very far. We’re trying–we’re trying to do
some of that now. But it’s still in an infancy, so I can’t really tell you if
it actually will work or not. That’s why when I pieced
this talk together, it was what is kind of ripe
for reality– what is–what is in use, and versus what is–
you hope will be eventually ripe for reality. – What are the major setbacks? [inaudible] – So vocabulary was one. Two was noise. You need
a strong enough signal– you need a base,
enough general vocabulary, so that you’ll be able to– be able to track
enough information. You need it specific enough so that basically you can
isolate it correctly. And the problem comes in that
you need also enough noise to be able to detect
in the live stream that when it spikes
and when it goes down. So isolating noise and non– non-major topics was
the biggest problem. If you have a major topic
like influenza, no problem. If you have a topic that
basically deals with food-borne illnesses,
we didn’t quite get as far, because geographical coding
of food-borne illnesses and people tweeting on it
was not sufficient for us to catch it, as of yet. I’m hopeful but cannot guarantee it for you. Wow, now we’re way over. – So with that, please join me
in thanking Dr. Frieder. [applause]
Thank you very much. [applause] [electronic sounds of data] [musical tones]

Leave a Reply

Your email address will not be published. Required fields are marked *