Open Data + Robust Workflow: Towards Reproducible Empirical Research on Organic Data


– And hello everyone again. It’s a great honor for
both my collaborator, Nan and I to be here and
to share our thoughts about organic data. Well a little bit
background about two of us. We are coming from a different world in terms of scientific community. Here I really appreciate the luxury about talking a lot about
how to write papers. Although there is some kind of debate in our discussions about the trade-off between pursuing papers
versus pursuing science right, and we are living in a different world. our debate is about
whether pursuing science versus pursuing money. We’re talking about real money. Because our teaching load, our promotion and tenure
are all correlated with the exact dollar amount in our internal research
accounts at Penn State right. Like this semester I am teaching
two courses or one course is really correlating with
how much money I have. So in our world we are
talking about real currency and it has been so luxury
for two of us to be able to enjoy the research discussion here and shift our focus a little
bit about research output rather than the money being the
input in our research world. So we have been so enjoying
this discussion here and I think Peter’s presentation laid a very solid foundation and shifting our discussion space, right. Peter talk about in my terminology
in one sentence summary, Peter talked about theory
driven discovery research versus data driven research
in the management space. And we are using a slightly
different terminology here. We refer to the more data
driven type of research, you know in today’s big
data world as organic data. What do we mean by organic data? Organic data is opposite to design data. where design data are the
traditional social science methodology where we have
explicit design mechanisms, have explicit data collection mechanisms and then followed by very
robust statistic analysis to analyze those data. So we frame these kind of
data space as design data. And in another direction,
because of big data, we have seen the Twitter datasets. Or even my phone, while I talking here my phone generate a lot of data about me. Facebook apps is knowing
my exact location here and maybe talking to another app and sharing my contact
book with other apps while I’m talking here. So the organic data refers
to this data generated organically while sensors on the web and on the social media without
explicit research design. How do we utilize these
kind of organic data? I observed a couple of
trends in the social science or computational social
science research space in the recent five years in terms of social science perspective
of big data research. One is aggregate large-scale design data. For example large-scale census datasets and administrative records;
those big national surveys in the past 40 years. And then think about how to do this large-scale
design data analysis that involve a lot of statistic
advancement and innovation. And then the second
trend I observe is more and more researchers in
the social science space try to utilize the organic data. Trader one of the most
popular dataset we observed out of the top tier
publications across the fields of psychology, sociology and
a lot of business fields. And then the third trend I observed is the joining of these two datasets. Two sorts of data. The data integration. Try to integrate the organic s with the traditional design data and to see how the design data can complimentary to our
traditional conventional datasets such as administrative records. As some of our presenter pointed out design data had some limitations such as self-reported, survey results may not reflect some objective phenomenons that generate organically
from the real-life situation. So that’s the space we’re talking about. And this I have already talked about this that we observe cross the
different fields of the business, top journals as well as
journals in psychology such as Psychological Science and even sociology like
American Sociological Review. We all gathered the papers
in the past five years that have been utilized
these organic datasets in their published research results. So for today’s talk we’re going to talk our kind of preliminary thoughts about using the Twitter as one
explicit organic data platform to think about the whole workflow process. And we summarized our
high-level abstraction of the research workflow based on our previous discussions
in this workshop and we heard a lot of
methodological discussions in terms of performing research that collect design
data at different phase, the sources of bias, the sources of issues that may trigger reproducibility, replicability or
robustness question marks. And what we can claim here is that our preliminary observation also kind of asks this question: Does these exact same
of methodological issues apply in the research process
that deal with organic data? Our intention is yes. But we haven’t got there yet. So we are going to just
present some of our preliminary observations
and maybe have your help to shape this stream of research. And in terms of exact operationalization of different research phase, we observe that in organic data phase researchers deal less
about subject recruiting, because data is out there. In the context of Twitter, researchers spend time on
elaborating their methods, explicit method, about how
they collect data from Trader such as you know the search
keyword specification. They’re using certain kind
of filtering mechanism to get the type of data that fits with their research question
or the research context. And then move to data processing. It’s trending nowadays
that social scientists use different combinations of algorithms or solutions developed
by computer scientists, such as engineering
solution, data integration, NLP/text mining, machine
learning, sentiment analysis. We can name all with the trend of AI. So these are the different
types of data processing in the research design
phase that then lead to the final data analysis. We observe that even with
organic datasets status analysis are still very popular
and in addition to that we also have other type of
more computational methods to analyze these datasets. So this is our observation of the so-called research
research flow in the context of organic data generation,
collection and analysis. And our question out of
this preliminary assessment or review of the Twitter-based
publications in the past five years across
different social science fields bring us question mark about
whether those reproducibility, replicability, or what we
called reproducible crises will occur in tomorrow’s utilizing
of organic data research. We are seeing a trend
of using organic data, but we haven’t get to the stage in terms of having a solid discussion about how we shall form these
methodological analysis yet. We are not there yet. And we’re still seeing reservations from other journals in
the fields of business or in the fields of
psychology or sociology. They do not publish research results that generated from organic datasets. And we do observe that trend by comparing the past five years publications using organic data
across different fields. So the question mark is:
Shall we actually move forward establishing a solid workflow so that more and more researchers can have confidence on those research results generated with organic data or shall we
just walk away saying that, well these organic data does not have a solid research design
we shall give it up? You know, forget about it? we are just concentrating
on the design datasets. So that serves our research inspiration and a little bit differences between me and my collaborator here is
that I actually have a lot of Business School friends
in my research community. (audience laughing) So in the broader presentation I think it’s better for me
to hand our question mark to my collaborator who is
trained as a computer scientist who hasn’t really
thinking about publishing at business journals or
psychological journals yet but he might hobby. You know, as part of my hobby
I still target about writing some business journals in my personal time although that’s not encouraged in my current institutional environment. So next I would have my collaborator to present our observations in this space. – Thank you Heng for
setting up the discussions. And well Heng said it in a nice way so let me put it in a kind of less nice way of the point I wanna drive home. So the point I wanna drive home is for organic data analysis, reproducibility is not the Holy Grail. I can disclose to you every single bit of the operation I do during data analysis and I can give you the entire
dataset I used for research. You can reproduce my findings
every single bit of it it still doesn’t mean the research finding is reliable and robust. And why is that? Because we do not know how to scrutinize the methodology used for data analysis. You tell me exactly
what I do, I look at it, I say, well it sounds reasonable. But is it really reasonable? I’ll show you four examples here. The point of it is not to point fingers, even though all these examples
come from papers published in the venues Heng talked about. One thing I wanna note is
every single example I give is actually a methodology
shared by more than one paper. It’s just not about a single piece of work adopts this methodology, there could have problems down the road. There’s really a pretty
prevalent misunderstanding or misconception of what
is or is not acceptable in data analysis workflow. First, well I guess if you
talk with computer scientists about the reliability of an
algorithm or about a workflow the first thing come to mind would be the devil of parameters. In order for you to process
data you have to set parameters, sometimes lots of them. These are examples of parameters. We had to remove sentences
with fewer than five words because they are too short. Okay, we consider the
500 most frequent words because you can’t expect us to
consider all of them, right. The list goes on and on. Well those are concrete
numbers so they are good and there are less
concrete specifications, like remove accounts that
had high daily traffic to avoid spammers. But what is high? Maybe that’s not disclosed, but that is a transparency issue. The point I’m trying to
make here is with all these parameters if I ask you:
Do they sound reasonable? What do you say? Well, they could sound reasonable. I don’t know without looking at the data. Even if I looked at the
data I perhaps don’t know. If I tell you once I
change that five words to four the research
finding just fall apart. Then it means the research
design is not reliable. And make no mistake, these
parameters as in today’s practice play an extremely important role in the validity of research findings. I’ll just show you one
example: These are real numbers from a paper. You collected tweets in the
number of 187,000 tweets. Okay, that’s your raw dataset. And then you apply some
of the sentences I listed on the previous slide. The first step, you
remove those ads and spam. It goes down to 71,000. Then you say, I balance some topics that are over-represented it comes down to just 10,000 tweets. So you are actually pitting
just 10% of the raw data you collected with some
arbitrary parameters that no one can provide
any justification for. So you better justify, if I change those parameters
your result remain valid. But it’s very hard to do, why? We actually talked about
pre-registration yesterday as quite effective in
the case of design data for a reason that as Joe
was arguing yesterday, the file drawer problem is
not of significant concern in design data because if you just fire every failed experiment,
well you better be very rich in order to find one that
actually has a low p-value. Here however, you don’t have to be rich. Testing a set of parameters
is essentially free. You rerun your program
and with a different set of parameters you get the results. So what’s preventing you
from pre-registering 10,000 different parameter
combinations hoping one of them would work and
give you what you want? That’s a challenge. I still think in pre-registration
could be effective but we perhaps need to think about how to implement that in this case. Another idea here is how about we automatically check
parameter variations. If you give us your
dataset and your program, well, we’ll design a system that will automatically vary
those parameter settings and then see if your results remain valid. It sounds like a feasible idea, but again there are
research challenges here. These parameter settings are correlated. You can’t just arbitrarily
choose a random combination and hope the results still make sense. Sometimes you change a few of
them, you have no tweets left and of course the
results don’t make sense. So how exactly if, by
the way look at the space of parameters, it’s really huge right? You have 10 possible
settings for 10 parameters, that’s 10 to the power
of 10 possible settings. How do you automatically
identify the kind of envelope of parameter setting
combinations that actually work so we can understand the
reliability of the research? That itself needs research
that has not yet been done. That’s the first example. Second example is to go from
parameters to algorithms. Now in social sciences, business research, Computer Sciences, we all use
a lot of heuristic algorithms. This thing that we have on
evaluating these algorithms, like if you read a paper on
how they process Twitter data they are essentially
presenting an algorithm. How exactly do you evaluate that? The criteria that I say we’d use today is really we look at
the algorithm and ask: Does it sound reasonable? But that’s a really hard thing to tell. This is an example. Research trying to find a
city of reference in a tweet. Say that I design the
following very simple and certainly sounds
very reasonable algorithm that simply looks at whether
a tweet contains the name of one of the 30 most popular
cities in the United States. And the authors even go on and say that I exclude Washington
DC because it’s ambiguous; Washington could be a
city, it would be a state. That certainly sounds
very reasonable to me until you you actually look at the data. Let me explain this chart a little bit. Now the blue, I’ll explain
the blue bar first, the blue bar are the percentage of tweets. If you search on Twitter for a city name, Twitter returns to you a set of tweets. If you then apply the state of the art natural language processing algorithm to do entity resolution
that blue bar shows you the percentage that the algorithm believes are actually referring to the city. And you can see Phoenix and
Houston are not very high, why? Well, because Phoenix could be a city, it would be many other meanings. When someone are talking about Phoenix, they may not be talking
about Phoenix in Arizona. They could be talking
about Phoenix in X-Men, or Harry Potter. It’s very difficult to judge
whether the previous algorithm is reasonable or not before
you look at the data. And someone might say that,
oh by the way the red bar is if you do a few further filtering by not only just trust
what Twitter returns to you for a city name, you also say I will do a whole word matching myself. Okay I’ll post-process the data, I’ll see actually if it
has this key word in it. It doesn’t really help. Even if it has the entire key word in it, in many cases it’s not
really about the city. So if you say, you know what I
understand there’s a problem, I will correct for it. If I know that only 20%
of tweets contain Houston and I wanna know actually
refers to Houston, then if I wanna count
the number of tweets, let’s say containing Houston, I’ll multiply the number by 20%. Wouldn’t that solve the problem? It won’t, because unfortunately this ratio actually change day by day. This was an experiment we did last August. You can see that’s a sharp
drop for Houston by the way. There’s a sharp drop in
the beginning of August and we asked why. Why suddenly there’s a
day when the percentage of tweets actually referring
to Houston as a city became much lower than the other days? It turns out to be the
birthday of Whitney Houston. So a lot of tweets were
actually talking about her rather than the city. That shows you, you can’t
just take the number and then correct for the bias because there are so many other variables that would be at play. Now one potential solution
here is well let’s try to pre-register the algorithm
that we are gonna use. This certainly sounds
a reasonable solution to prevent you from kind of hacking to find a seemingly
reasonable algorithm to use. One, you analyze the data. But then at the end the
challenge here is really, well if you say I’m
gonna use this algorithm in natural language processing, how much detail should you put in that pre-registration form? Can you actually afford to put every single line of detail there? Can you actually decide what algorithm you are gonna use before
you even see the data? That is a challenge. Again, I don’t know what’s
the solution for that, but that is subject to future research. Another thing is you can adopt
state-of-the-art technology rather than come up with
your home-brewed rules of processing data, like
just do keyword matching when there are entity
resolution algorithms out there, off-the-shelf systems that you can use. The third one is, and
I want to actually dig a little bit deeper into that, is sometimes we have to ask ourselves: Well you know machine learning or this automated algorithms sounds fancy, do we actually need them? So here’s actually a comparison. If you have 10,000 tweets
you need to process, you can say that’s too many. And if I asked my students to label it, I don’t have that many students. So I’m gonna use an NLP
algorithm to help me do that. That’s a reasonable
argument, before you think how about I just take a 100
samples from these 10,000 tweets and then ask my students
to label the samples and then I wanna extrapolate from there to draw my conclusions. These are the confidence
interval 95% you are gonna get. If you ask your students
to draw a 100 samples and then rate them the
ratio is, I forgot what kind of statistics it is, but
the confidence interval is 0.8 plus minus 0.08. If you asked NLP to do
it: 0.68 plus minus 0.01. The variance, the standard
error is much smaller certainly because NLP processed them more, but that NLP algorithm has a
bias, which means the answer you got, 0.68 is not correct actually. Let’s take a look, it’s outside the manual processing’s
confidence interval. So do you wanna sacrifice biasness of your answer by just adding
a smaller standard error? How do you evaluate the
bias of these algorithms? All those become open questions. So we did actually just
took a, you can say, theoretical exercise and we thought about how about combining the two. How about we combine manual laboring and automated techniques? We asked automated
techniques to run it first and then we randomly
sample the output of it. By stratifying the sample
we can optimally allocate human resource to verify
the accuracy of the results. It seems that with this
approach we can prove that the the the mean
squared error you are gonna get at the end, which is
bias squared plus variance is guaranteed to be lower than
either the manual processing or the automated processing. But that might be something
you could consider when processing the data. That’s the second point. The third point, third point
is a little more complex. It’s about what we call
as unbalanced across a structure on social networks. What do I mean by that? This is often what happens
with design of experiments on social networks. You take a treatment group,
you take a control group, and then you treat that treatment group with whatever you wanna manipulate or what intervention you want to put on and you compare these two. The special feature of
social networks, however; is that it’s very costly to
have a large treatment group. That’s if you wanna give
them some premium membership, have a 1,000 treatment
group is very expensive. On the other hand control groups are free. You don’t do anything to them, you just observe how they behave after a certain period of time. You can take essentially infinite number of control groups as you want, collecting a million control groups of each 1,000 is a piece of cake. So that brings about a question and we have seen these
practices, is how about, because we have limited resources, we have one control
group, one treatment group and then we take let’s say
a 1,000 control groups. The idea here is this we
have one treatment group and then we apply our experiment over that treatment group and
then we measure the effect by comparing that one treatment group with the 1,000 control groups. The logic is as follows: Okay so if that treatment group, if we
measure the degree of effect, if the treatment group outperforms 999 out of that 1,000 control groups, then I say, well the
treatment must be effective because by chance it’s just 0.1%. Is that reasonable? Can you actually say because
even if I only have one control group because I
don’t have a lot of money I have a 1,000 or a million
control groups to back me up. Well if you actually have
the null hypothesis saying that the effect on the treatment group is the exact same probability distribution as the effect on control group,
then yes that is reasonable. If the null hypothesis is they
have the same distribution and you wanna reject that null
hypothesis that is reasonable because in order to outperform
95% of control groups when the null hypothesis
holds, the chance is just 5% for you to see what you see. So that is correct. It’s just unfortunate that’s not really the null hypothesis you
want to reject, right? What you want to test, the
hypothesis you really wanna test is whether your treatment is effective. And if you actually
look at the distribution in the real world it’s very far from what you’re gonna see here. I just show you one kind
of extreme case example to illustrate the theoretical
absurdity of this before showing you some
real examples in real world. This is the absurd example right? Think about if this is
actually the distribution of that effect over the treatment group, you have 50% chance to
have the largest value, you have 50% chance to
have the smallest value. The treatment is not working. It’s not like people
are more likely to pay in your control group then in your, sorry, in your treatment group
then your control group. It’s the same, but think
about what’s gonna happen if you use this one treatment
and a million control idea. For 50% chance you are gonna say, well I outperformed a
100% of the control group. You say well practically
the chance for me to get to that is zero so this is
a very very strong finding. In this case it’s not. You have 50% chance to get to that, it’s just you haven’t
seen the other 50% yet. If you say this is an extreme case example and that’s not how
things work in practice. This is how things work in practice. This is actually what, so let me explain how exactly I did this simulation. So the way that we did this
simulation is we actually take this idea of one treatment group and many in control group and then do a simulation on a scale-free natural generator called the Barabasi-Albert model and we rerun it many many many times. So each time we rerun this we
regenerate a treatment group and a thousand control group. We measure how much percentage of the control groups are outperformed by that treatment group. Okay so this each point on
the curve is actually what you are gonna see as the conclusion of the methodology I
just talked about okay. Now after we have run this
for thousands of times we actually ordered the
number that you observe from each trial from low to
high and plot it right here. Now what I want you to see is, according to the scale-free
natural generation model you have about 30% percent
chance to conclude that you outperform virtually a
100% of control groups. If you draw a conclusion
that way it’s not gonna be reliable because since,
well you’ll have 30% chance to say you have a p-value of
0.00001, something like that. That’s the third example. How exactly do we address that issue? There are actually
different ideas to do that. One is to identify outliers. A particular reason why
we are observing this is because we are dealing
with a scale-free network, which means there’s a
heavy tail out there. There’s a small number of users that have a extremely large influence
on what you are gonna observe. So in some cases it becomes
if you are lucky you catch one of those users great
you outperform everyone, if you don’t, well you are at the bottom. You have to identify these outliers. So how exactly do you
identify the outliers? Actually what we talked about yesterday about bootstrap sampling
is a very effective idea. You subsample the
treatment group and there is a chance you are gonna
subsample the treatment group to get one without that high influencer and then you will be able to tell that you actually hit an outlier and that’s what makes you
see what you have seen. Of course another idea would be to simply have more treatment groups
instead of just one. But the point I’m trying to make here is you can’t just assume one treatment and a million control
groups are gonna give you the same kind of conclusion
as you’re gonna get with multiple treatment groups. The last one is about operationalization. So what do I mean by operationalization? This is an example of operationalization: If you want to measure chatter of a particular topic
like fracking on Twitter this is what people actually
do in their research, they don’t Twitter messages
that contain the word fracking. This is operationalization. This is to measure. We can’t directly measure chatter, so we measure the number of
tweets containing that keyword. Is it reasonable? It certainly sounds reasonable
until you actually look at the data and you find
that, you know what, a lot of people when
they talk about fracking they don’t actually use the word fracking. The popular hashtags like frack is whack (audience laughing) and one important thing is these hashtags have very significant skewness if you may in terms of their opinion. You use Obama here if you are against it, you use ACA if you are for it. That’s a very famous example. And actually Washington Post did a study just drawing the word cloud
of congressional members after they vote for or against
reopening the government after it was shut down back in 2013. You see their keywords
are completely different from one another. If you are for it you
talk about reopening, if you are for reopening the government. You talk about reopening
or talk about shutdown. If you’re against reopening the government you talk about debt. How do you capture all of that? How do you operationalize
what you want to measure by actually finding
the keywords that match not only match what you want
and is comprehensive enough to capture the entire spectrum of opinions without introducing bias? That is very very difficult. This is another example
of operationalization. The idea is we want to know the similarity between the brand images
shared by, not shared, the similarity between the brand images of two companies represented
by two Twitter accounts. We can’t directly measure that, so the operationalization is to say, let’s count how many of
their followers are the same. How many common followers do they share? Is that reasonable? Well we can say, I’m sure if
you actually believe people follow an account so they share the same value as that account. In practice is it really the reason? Well there are some
well-known contradictions like many people following
a Real Donald Trump is not really for Donald Trump
but against Donald Trump. Match aside there are some subtle reasons like if you actually look at
in one case in the research those environmental
friendliness advocacy groups just to measure how
environmental friendly a brand is it depends really on which
account you are looking at. EPA research is the US version of it. EU is the EU version of it. You can see that their
follower distribution are very different based on
Geographic distributions. So if you actually count the
similarity that the common followers shared between EPA research and let’s say Rolls-Royce cars or Cadillac you are going to draw the conclusion based on the first account that Cadillac is more environmentally friendly and for the second one Rolls-Royce seems to be much more
environmental friendly. Well it’s just the artifact of Cadillac has a larger
audience in the United States. And it doesn’t stop there. There are many other
such confounding factors. For example this is the distribution of when their Twitter
followers joined Twitter. That’s very different. Their Twitter followers joined
Twitter at different time. And why is that? Because think about you
following an account. You tend to follow a lot of account when you first join Twitter and then if you don’t use it actively it’s forever there. It’s really not reflecting
your commitment, or your agreement with
the image of a brand. It’s affected, it’s influenced
by many many other factors. That’s the last point I
would like to make which is well, operationalization is tricky and before we operationalize
what we wanna measure by some behavior we observe on Twitter or other social media websites
or the web in general, we’d better understand
this behavior first. We have some understanding
on Twitter for example over-representing minority populations, over-representing people
with lower education and lower household income,
but we don’t understand, thoroughly at least: Why exactly do they follow other Twitter users? Why exactly do they favorite tweets? Before we understand that
perhaps we should take a pause when operationalizing
what we want to measure with that counted Twitter behavior. And another question is: Can we develop some automated
charting techniques? Like we found that the
P-words you have used or the Twitter accounts,
the set of Twitter accounts you have been using, is very much biased on one part of the world. Then maybe we can give you an alert saying that before you use
this list, think again, maybe find more relevant
accounts in some other region of the world or was
created in some other years that was not covered by your current list. So that is a potential solution. But just to summarize the
point I’m trying to make, again, these are not saying that these are the only four kinds of problems in organic
or web data analytics, there are a lot of problems like this. But the point we are trying to make today is that for organic web data analytics for research finding built upon that, the main challenge may
not be reproducibility or we have to go beyond
the reproducibility. Because simply sharing all of your data and every line of your
code does not guarantee the reliability of your result. It’s perfectly reproducible,
but terribly unreliable. It could be. (audience chuckling) So what do we need? We believe we need more research in methodological scrutiny. We have to understand how to
scrutinize such methodology. Just like before, all the scrutinization for questionnaire design was developed that we perhaps don’t know how
to develop a questionnaire. Just like today we don’t really know how to develop this workflow
of organic data analytics and we have to do research on that. These are just some examples
of how exactly do we do that. We have to isolate
critical design choices. Filtering sentences by
the number of words in it. Is it an important decision to make? We need to understand that. Filtering Twitter accounts based on whether it contains a URL, which is a common practice surprise to me, many papers use it to filter out spammers. And you can find other
papers that specifically look for tweets with URLs
so they can understand how message got propagated. They can’t both be true. I mean one is saying the other is using noise, how do we deal with that? Which are the critical design decisions. And then we have to identify
potential sources of bias. What are the confounding factors? What are the biases
introduced in this process? Which of these biases, well
bias exists everywhere, but which are the biases that actually would
falsify a research finding? We have to understand that
and then we have to understand which part of the design
leads to that problem. And finally I think we
can provide guidelines for future research so at
least we can address some of the concerns that we see
today in organic data analytics. And thank you very much. (audience clapping) – [Participant] That’s
was really interesting, so I guess one question is, I mean a lot of what the
problems that you are drawing about is because there’s
a lot of sample selection that goes on from the parameterization. Even like who uses Twitter. Like I don’t use Twitter. I don’t wanna use Twitter. I can’t even imagine why a
person would care about it. But that’s part of the problem. So my question to you would
be: Well what about other, I’ll call them more targeted techniques? ‘Cause what you’re talking
about is like you know just out there on the web and so there are a lot of factors that you’re just never
gonna have control over or maybe knowledge about. What about if you were
looking at emails within an organization or you
could get email and chat and other kind of communications
as a way of trying to capture sentiment for
example or things of that sort. And I use this as an, so I’ve heard a talk where
IBM in fact does this. Does this in real time. So I heard an explanation
where the CEO gave a speech. Some information was being
released to employees and then using Watson and IBM
Analytics in near-real-time they could do a sentiment analysis and feed that back to the CEO. So that’s much more targeted. I guess my thought is: What
do you think about that vis-a-vis the kind of problems
that you talked about? Just a you know a high level answer. – That’s a great question. So let me try to answer it first and then see whether
Heng has anything to add. So one thing I wanna
note here is this talk, the main purpose I don’t
think is to identify issues with web data whether it’s representative or whether conclusions
from it is generalizable. But rather my main focus,
our main focus here, is to say that when you
use automated techniques like the sentiment analysis that IBM used to give feedback at real time, we have to be very careful
on understanding the implications of using
such automated techniques. So I would like to say there
is kind of a culture difference between computer science
research and social sciences and business research. A lot of computer scientist research is really about optimization. You want to improve the
performance to the best you can. If it’s 85%, 95%, that’s good. We have a number so you optimize for that. But in social sciences research or some social sciences
research and business research you care about the final finding which is black and white. So that actually is a problem. Because when you translate a 95% accuracy to right or wrong of the final finding there’s no direct translation here. You have to understand
what exactly is in that 5%. Now I want to give you
just one example here. let me see if I can find that picture. Okay here we go. But this is an example of
using automated algorithm. In this case image
recognition techniques using a deep learning algorithm. Of course this is a artificially
constructed example, but I just wanna show you
how tricky this 5% can be. Image recognition or facial recognition or kind of sign
recognition is very mature. Accuracy rate is greater than actually asked in
human beings to do it. So this is one of the places where automated techniques work the best. But this is to demonstrate
these two images look maybe exactly the same to you. Well a deep learning
algorithm could treat the left as a stop sign and the right as a yield, because the right one was
specifically constructed in the topic called
adversarial machine learning to treat that algorithm. Why exactly this trend? Well we don’t really know because we don’t really
know why deep learning. We know to some degree why it works, but we don’t really
have a analytical answer as of which cases it’s guaranteed to work and in which cases it’s not. This is artificially constructed. You can say it doesn’t
happen in the real world. But again, before we take the leap from having a 95% accurate algorithm to say that this finding
is statistically valid, how do we bridge this gap? I think that’s the main
point we are trying to make in this presentation. – [Participant] Thank you. – [Participant] Hey thanks
for the presentation. It was great as a strategy researcher that looks at data like this. It resonated a lot ’cause there’s lots of messiness out there
and it’s really hard to have concrete numbers
in any sort of way. The question that I have is: So Twitter is just about dying from what I understand. So if you make this large investment in this particular technology
to understand Twitter and there’s a new
technology that comes out. What are we to do as researchers if we start building up these toolboxes? What do we do in terms of moving forward? – Yes, that’s a very good question. I get that question all the time. When you develop something
based on Twitter people asked, well what if Twitter
changed our API design? What if Twitter just died and another social media
platform came along? But one thing I wanna say
is a lot of the techniques that I was advocating that we should look into here are really generic. They focus on some common
methodology adopted during the analysis of organic data no matter where the other
organic data is from. Whether it’s Twitter,
Facebook, or just emails. Unstructured data to structured data. If you wanna do sentiment
analysis over unstructured data, if you wanna do entity resolution which is to extract a list of human names, locations, et cetera
from unstructured text then you need to
understand the implications of these algorithms. That’s no matter whether
you are using Twitter or if you’re using any other data source. And the treatment versus
control case I was referring to; the artifact that we are observing is actually, as we understand it today, is due to the scale-free network. It’s due to the power law distribution. Which again, is a phenomenon you observe in many different places not just Twitter. So certainly there are things that are very specific to Twitter. Like there has been papers in
the computer science community just to understand when Twitter say, Twitter has Firehose and Twitter says that if you subscribe to a streaming API it gives you up to 1% sample
and there are researchers that found this 1% is not a random sample. It is a sample, it’s just it’s not random. That research would be specific to Twitter and may not be generalizable
to other places, but I would say there’s
a lot more research to be done that are generalizable to other organic data sources as well. – I want to add that Twitter
may die as a platform in terms of generating those data
(speaking off microphone) But the type of data like tax data. How you process tax versus image data is a fundamental question and is a fundamental methodological
concern we have here. – [Participant] I have a question about looking beyond the measurement. So your presentation
has been focusing on the measurement issues which
as IO psychologists and OB researchers
would really care about. But let’s step back a little
bit, look at a big picture. In organic data you have the chance of forecasting future behaviors. Whereas, in OB and IO
psychology it’s not that easy to forecast future
behaviors of human beings. So what if your golden standard is switched to forecasting for future? And do you still care that
much about measurement like Houston or Washington DC, this type of detailed measurement issues, would that be still an
issue if you are able to forecast in 99% with correct
rate for future behaviors? – That’s a very interesting question and I can give you a real example of that. Recently there has been this debate in the analytics community,
the web analytics community, about using Twitter or for
that matter any social media to predict elections. And people found dozens
of papers trying to say that Twitter is or other
social media platforms are reliable sources for
predicting elections. It’s just that none of those papers actually made any prediction. They all looked back into history and say, well the Twitter data would be consistent with how the election turned out. And if we knew anything
by now, it’s just perhaps, I guess, if you ask me too bet, we don’t know how to
predict the elections. so I think there’s certainly I
see that as a problem, right? If you claim that what you’re doing is to predict the future elections, but you don’t make any prediction then you know how do we evaluate the value of that (mumbles). On the other hand, if we
made future predictions then … Look… for election for example, it’s not like it happens
very frequently it doesn’t. It’s not like our
publication model is built so we make a prediction
and if we are right for a certain number of times then we have a paper published. So there are cases in the
real world where prediction and verification lead to the success of a model or a research. For example, there’s a very
famous website I think, what was the name of that,
yeah exactly, FiveThirtyEight. FiveThirtyEight is famous
because the founder actually predicted elections
based on many different polls. So he had a weighted
model of different polls and then made correct predictions and over time he gained his reputation and then people respect his opinions. So it could happen in the real world. I’m not sure in how many topics we study it can be readily translated into a operationalizable
paradigm so that we can require any kind of organic data analysis to make some predictions that
we can verify down the road. And then once again those predictions I think fundamentally
the gap here is really, the norm seems to be the
finding is either true or false. But in the real world it’s much more complex than that, right? Even for FiveThirtyEight
when they were predicting Trump versus Clinton they were saying Trump has 30% chance and Clinton has 70%. You observe the outcome, what
do you say about that model? I guess it’s inconclusive. So there is this gap on understanding we want something if we
believe it to be true to have perfect prediction every time, which is not gonna happen. And prediction is not easy to come by. So I think it’s a great
idea if prediction is used as the golden standard. How to operationalize that? I’m not sure. – So this actually, I
just want to jump in, one comment, that this
reminds me about the thinking of use of organic data research. What’s a row of organic data in terms of in our research space. I think multi-method may be
naturally the way in the future. Some data-driven exploratory can review certain correlation patterns
and followed by solid design, design data collections
through experimental survey may provide additional insights or when we have secondary large datasets such as census data,
well we can actually map that with some organic dataset to review more observational patterns. But meanwhile we can get
some solid design instrument about revealing other parts. So I think my view, may
be different from yours, is that how we create certain
kind of hybrid approach about mixed data makes
design and a mixed method. – [Participant] So I
just wanted to highlight a connection with a session yesterday. So yesterday we learn in the first session that p-hacking is a big problem and that the main
solution to p-hacking was to create register studies. Well you have shown that pre-registering is not a practical solution so we are back again with a problem, yeah? (chuckling) So what are your thoughts? What would you suggest? – I wouldn’t say pre-registration is not a effective solution. I think pre-registration is very effective for ensuring reproducibility. The point we were trying to
make is just reproducibility is actually easy to ensure
in organic data analysis. The real challenge is robustness. And pre-registration could
help with robustness. Like if you pre-register
that algorithm you use, at least your, that’s before you actually start collecting Twitter data, then at least you eliminate the chance of fishing for the right
heuristic algorithm to use down the road. But on the other hand
there are many other cases where it’s not gonna be effective. – [Participant] But the chance
that you’ve pre-registered the exact parameters that will be helpful to uncover any patterns
it’s very very low. – Sure, yeah exactly. So that’s that’s why I think it’s helpful to a certain degree. It’s not gonna solve the entire problem, because you can’t pre-register
every single detail. So I would like to see pre-registration as kind of reducing the degree of freedom in the experiment design in some sense. So it’s just in design experiments in a lot of cases the degree of freedom is not as high as in
organic data analysis, because you have these tons of parameters, tons of algorithms to choose from. But the lower of the degree
of freedom the better. So it’s gonna help. It’s just not gonna
solve the whole problem and we need other kind of research to go into looking at how to solve the problem like the automated charting tools. – [Participant] No, no. – [Participant] Well I wanted to go back for a second to prediction. For organic data and
prediction there is very importance from a
multi-heuristic point of view. Yes and what I’ve noticed and I would like to know your opinion about
that is that paradoxically the fact that organic
data has these extreme value distributions and
scale-free and different types of networks, it is kind of somehow easier using the modeling techniques we have to predict person-centric
or individual trajectories as compared to population trajectories. There have been huge fails to predict, for instance, flu incidents,
to predict epidemics because of topics virality because of automated polls and everything. However, this is very important. We need more modeling techniques because with standard surveillance, population-based estimations and person, individual-base estimation,
they cannot go in the same way and use the same modeling techniques. With organic data you can’t use the same statistical techniques because you have extreme value theory. You need new modeling techniques. What do you think is the state now? State-of-the-art for
population-based predictions? – Well I think it’s really
a topic-specific question in that that’s kind of
the 5% I was referring to. When you have an algorithm
that works for 95% of the time it doesn’t mean statistical
conclusions you draw from the results, from the
outputs of the algorithm, are gonna be valid simply
because you don’t know where that 5% is gonna end up to
and what’s the implication of that 5% on the validity
of your final result. Like you mentioned
population-wide prediction. We have to understand what
we are trying to predict and whether that prediction
depends on some extreme values or it depends on some reliable statistics over the entire population. If it actually depends on,
let’s say, some extreme values; like one person traveling
from one state to another would actually significantly
affect the final result, then I’m not sure if we could
have, even theoretically, a prediction model at all. So I’m not sure what exactly is the best way forward for that problem. I just feel that we need to look at that at a case-by-case basis
to understand what we can and cannot do in terms of
predictions and analysis. And that’s actually the point
we are trying to make here is before we actually do something we need to understand the achievability of it and the implications of it. – [Participant] Yeah, so
nice, thank you very much. This is a very good perspective and just to facilitate conversation so what you’re encountering here in our field is construct validation. I really think it is construct validation. And then for anything we
do for construct validation oftentimes we need a golden standard. I think the issue you’re encountering is, in the organic data
situation you don’t really know what the golden standard is, or it’s very hard to
measure the golden standard. Can you comment on that? – Do you wanna? – Well, I don’t know
whether we are talking about orange or apples in
terms of construct validation. I dunno. I think you guys are the
experts how you view that. For the unstructured datasets if we want to do operationalization
how we are going to do that or what kind of test we shall establish in terms of establish internal
and external validities. – [Participant] So one
thing I would think, like for example earlier you mentioned like you have census data right? So census data you can treat that as if it’s a golden standard right? Then you can look at in the organic data whether you can actually use
certain organic data measure to represent the census data right? So I would say, okay that
gives you some way of validate, like what your organic data is measuring. So Steve, you wanna? – That’s a nice idea in terms– – [Participant] What? – Of using those design
data as a baseline. – [Participant] That goes
back to my opening question about more targeted data. And more specifically, I’m
doing some research for NASA so these are very targeted datasets where people wear sensors,
interaction sensors, and over long periods of
time, small group of people, six people in a very
tightly confined space and they fill out
questionnaires every single day. So we have a gold standard on a number of different criterion
measures that we can look at. And a very simple example would be: Okay, can I predict how you’re
feeling about being a member of this group routine just
based on how much time you spend in close proximity
to anybody else on the team? And yeah, I can account
for that statistically. And then I could think
about projecting forward and saying, well I could
observe a given team and I start to see these patterns, these social network patterns emerge, maybe I could predict
if two people interact that one of those people is
gonna have a bad day tomorrow because they don’t like that person. So I think that does go to
the construct validity issue, because in a way we’re trying to take more constrained organic
data so we don’t have some of the problems that you were
talking about with Twitter, to see if we can use it as a proxy or an alternative to
asking people questions. I think that’s a very narrow view, but hopefully we could think about ways in which you could blow that
up to a like broader scale and use these multiple methods, ’cause we also have journaling, they text, so we can look at sentiment. So there’s a variety of methods
that are being examined. – Yeah I see, in terms of
golden standard, so there are two things that I’ve seen
are different in this talk. One is about the reliability
of the data you used. The other is about the
reliability of the method or algorithm that you use. So I guess the golden
standard applies to both. For the first one, in
terms of the data you use, I guess it really
depends on the discipline you are talking about. In psychology they say a
lot of experiments are run with a small number of human subjects, so the representativeness
of that sample I guess, whether consider that as golden
standard has implications on how you consider as golden standard in, let’s say, Twitter data. Similarly, if you’re
doing sociology research then the bar is different. So I think what is golden
standard is really subject to the specific topic being studied here. But one thing that I think
is common is the second part where the algorithm golden standard and I think it’s not about
finding a perfect algorithm that could read that stop
sign 100% time accurate. You’re not gonna have that algorithm, but it’s about how exactly
do we establish a workflow so we can test the
validity of that algorithm and understand how it
works in these actual cases and whether those actual
case performance has any implication on the kind of black or white answer we are trying
to provide at the end. Can we bound it for example, right? Can we say that even if this
algorithm performs poorly in these cases we
guarantee it’s implication on the final statistic we’re interested in is not gonna exceed
this particular number. That is the kind of question that I think should be answered. – And I also, this reminded me, some of the research
trends I observe in terms of this operationalization
in construct validity, I don’t have smart ideas,
but I noticed that more and more people use
algorithms try to generalize this measurement and then they follow by large-scale crowdsourcing
type of platform to use huge number of human beings within a very short
amount of time to confirm. So the algorithm says
this image is an animal. And then you can employ
back-end you can integrate with a crowdsourcing
platform utilizing millions of people to quickly confirm, saying that this image is
whether a human being or animal. I see the simplified way in
terms of establishing validity using crowd workers and establishing crowdsourcing mechanisms nowadays, but I don’t think that
methodology can apply to all. Yeah that’s a such such a simplified way so that’s my comment. – [Participant] Yeah
sorry sorry, I’m standing between you and the lunch. Well no, I think that’s
a very good point right. So the crowdsourcing, yeah I
know a lot of people do that. But the issue is oftentimes
is based on human perception. It’s like really like a
perceptual kind of validation. But whenever we’re inferring to like a psychological process or even just like whether the word mean the behavior it means. Well that becomes an issue
I actually don’t think crowdsourcing would be a
good way to answer that. But then I’m going to net a nice point about the golden criteria about algorithm. So I think that actually links to the internal validity
issue in our field, which is, well so your algorithm represents something for
you to make inference. Okay, so how can you
guarantee that inference is what you like say it would be? I think that’s the way
I would approach it. Algorithm-wise I think you can do some cross-validation for that. You can probably, well
the golden standard there is a little bit murkier. I would say that. So then there is where
you need a good theory. Because when you don’t have a good theory it’s very hard for you to construct a test for this kind of golden standard thing. So that’s my two cents. But this is good, awesome, thank you. (audience clapping)

Leave a Reply

Your email address will not be published. Required fields are marked *