Here's a flowchart for resolving disagreements. I made this up after a disagreement that worked out badly -- I wanted to think through how we could have avoided the problem. Also, it was a chance to write haikus.
What do you think? How would you extend or revise this picture?
Thoughts on computation, social science, and lifehacking
from an up-and-coming data scientist.
Friday, March 30, 2012
Monday, March 26, 2012
Why mturkers should have faces
Over the last week or so, I've written about several pain points for requesters using Amazon's mechanical turk. Most of the come down to issues of trust, learning, and communication -- very human problems.
I speculate that the problem is one of philosophy. The design and advertising of the system both suggest that the designers think of turkers as interchangeable computation nodes -- they certainly treat them that way. The market was not designed to take into account differences in skills among workers, strategizing by workers, or relationships between workers and requesters. It was built --- intentionally --- to be faceless*.
Essentially, Amazon intended mturk to be like the EC2 spot instance market for human piece work: cheap computation on demand. An API to human intelligence. It's a clever, groundbreaking idea, but the analogy only holds to a point. Unlike instances of cloud machines, turkers are human beings. They have different skills and interests. They learn as they go along. They sometimes cheat if you let them. Treating them as faceless automata only works for the simplest of tasks.
A better, more human approach to crowdsourcing would acknowledge that people are different. It would take seriously the problems of motivation, ability, and trust that arise in work relationships. Providing tools for training, evaluation, and communication---plus fixing a few spots where the market is broken---would be a good start.
Let me finish by waving the entrepreneurial banner. I'm convinced there's a huge opportunity here, and I'm not alone. Mturk is version 1.0 of online crowdsourcing, but it would be silly to argue that crowdsourcing will never get better**. What's next? Do you want to work on it together?
* There's a whole ecosystem of companies and tools around mturk (e.g. Crowdflower, Houdini). I haven't explored this space thoroughly, but my read is that they're pretty happy with the way mturk is run. They like the facelessness of it. Even Panos Ipeirotis, whose work I like, seems to have missed a lot of these points -- he focuses on things like scale, API simplicity, and accuracy. Maybe I'm missing out on the bright new stars, though. Do you know of teams that have done more of the humanizing that I'm arguing for?
** Circa 530,000,000 BC: "Behold, the lungfish! The last word in land-going animal life!"
I speculate that the problem is one of philosophy. The design and advertising of the system both suggest that the designers think of turkers as interchangeable computation nodes -- they certainly treat them that way. The market was not designed to take into account differences in skills among workers, strategizing by workers, or relationships between workers and requesters. It was built --- intentionally --- to be faceless*.
Essentially, Amazon intended mturk to be like the EC2 spot instance market for human piece work: cheap computation on demand. An API to human intelligence. It's a clever, groundbreaking idea, but the analogy only holds to a point. Unlike instances of cloud machines, turkers are human beings. They have different skills and interests. They learn as they go along. They sometimes cheat if you let them. Treating them as faceless automata only works for the simplest of tasks.
A better, more human approach to crowdsourcing would acknowledge that people are different. It would take seriously the problems of motivation, ability, and trust that arise in work relationships. Providing tools for training, evaluation, and communication---plus fixing a few spots where the market is broken---would be a good start.
Let me finish by waving the entrepreneurial banner. I'm convinced there's a huge opportunity here, and I'm not alone. Mturk is version 1.0 of online crowdsourcing, but it would be silly to argue that crowdsourcing will never get better**. What's next? Do you want to work on it together?
* There's a whole ecosystem of companies and tools around mturk (e.g. Crowdflower, Houdini). I haven't explored this space thoroughly, but my read is that they're pretty happy with the way mturk is run. They like the facelessness of it. Even Panos Ipeirotis, whose work I like, seems to have missed a lot of these points -- he focuses on things like scale, API simplicity, and accuracy. Maybe I'm missing out on the bright new stars, though. Do you know of teams that have done more of the humanizing that I'm arguing for?
** Circa 530,000,000 BC: "Behold, the lungfish! The last word in land-going animal life!"
Labels:
communication,
crowd-sourcing,
market-design,
mturk,
pain-points,
trust
Thursday, March 22, 2012
Pain points in mturk
I posted a couple days ago on skimming and cherry-picking on mturk. Today I want to add to my list of pain points. These are things that I've consistently faced as I've looked at ways to integrate mturk into my workflow. Amazon, if you want even more of my research budget, please do the following. Competitors, if you can do these things, you can probably give Amazon a run for its money.
Here's my list:
1. Provide tools for training turkers.
Right now, HITs can only cover very simplistic tasks, because there's no good way to train turkers to do anything more complicated. There should be a way for requesters to train (e.g. with a website or video) and evaluate turkers before presenting them with tasks. It's not really fair to impose the up-front cost of training on either the turkers or requester alone, so maybe Amazon could allow requesters to pay turkers for training time, but hold the money in escrow until turkers successfully complete X number of HITs.
2. Make it easy to communicate with turkers.This suggestion goes hand-in-hand with the previous one. Right now it's very difficult to communicate with turkers. I understand that one of the attractions for the site is the low-maintainence relationship between requesters and turkers. But sometimes it would be nice to clear that barrier, in order to clarify a task, give constructive feedback, or maybe even -- call me crazy -- say "thank you" to the people who help you get your work done. It's possible now, but difficult. (Turkers consistently complain about this lack as well.)
3. Make it easy to accept results based on comparisons.
Monitoring HIT quality is a pain, but it's absolutely necessary, because a handful of turkers do cheat consistently. Some of them even have good acceptance ratings. I often get one or two HITs with very bad responses at the beginning of a batch. I suspect that these are cheaters testing to see if I'm going to accept their HITs without looking. In that case, they'd have a green light to pour lots and lots of junk responses into the task with little risk to their ratings.
As long as it's easy to get away with this approach, cheaters can continue to thrive. "Percentage of accepted tasks" is a useless metric when requesters don't actually screen tasks before accepting them. What you want is the percentage of tasks that were screened AND accepted. Some basic, built-in tools for assessing accuracy and reliability would make that possible, effectively purging the market of cheaters.
4. Provide a way for small batches to get more visibility.One of my main reasons for going to mturk is quick turnaround. In my experience, getting results quickly depends on two things: price, and visibility. Price is easy to control. I have no complaints there. But visibility depends largely on getting to the top of one of mturk's pages: especially most HITs or most recent. If you have 5,000 HITs, your task ends up on the front page and it will attract a lot of workers. But attracting attention to smaller tasks is harder. Mturk should make a way to queue small batches and ensure that they get their fair share of views**.
5. Prevent skimming and cherry pickingI've written about this in my last post. Suffice to say that mturk's system currently rewards turkers for skimming through batches of HITs to cherry pick the easy ones. This is not fair to other workers, wastes time overall, wreaks havok on most approaches for determining accuracy, and ruins the validity of some kinds of data. I can't blame turkers for being smart and strategic about the way they approach the site, but I can blame Amazon for making couterproductive behavior so easy. Add a "Turkers can only accept HITs in the order they're presented" flag to each batch and the problem would be solved!
Looking back over this list, I realize that it's become a kind of freakonomics*** for crowdsourcing. There are a lot of subtle ways that a crowdsouring market can fail, and devious people have discovered many of them. In the case of mturk, it's a market in a bottle, so you'd think we could do some smart market design and make the whole system more useful and fair for everyone.
* Right now, one strategy is to dole out the HITs one at a time, so that each one will constantly be at the top of the "most recent" page. But this makes it hard for turkers to get in a groove. It also requires infrastructure -- a server programmed to submit HITs one by one. Most importantly, it essentially amounts to a spam strategy, with all requesters trying to attract attention by being loud and obnoxious. You can't build an effective market around that approach.
** Sites like CrowdFlower are trying to address this need. I haven't used them much -- relying more on homegrown solutions -- so maybe this is a concern that's already been addressed.
*** The original freakonomics, about evidence of cheating in various markets, before the authors turned it into a franchise and let popularity run ahead of their evidence.
Here's my list:
1. Provide tools for training turkers.
Right now, HITs can only cover very simplistic tasks, because there's no good way to train turkers to do anything more complicated. There should be a way for requesters to train (e.g. with a website or video) and evaluate turkers before presenting them with tasks. It's not really fair to impose the up-front cost of training on either the turkers or requester alone, so maybe Amazon could allow requesters to pay turkers for training time, but hold the money in escrow until turkers successfully complete X number of HITs.
2. Make it easy to communicate with turkers.This suggestion goes hand-in-hand with the previous one. Right now it's very difficult to communicate with turkers. I understand that one of the attractions for the site is the low-maintainence relationship between requesters and turkers. But sometimes it would be nice to clear that barrier, in order to clarify a task, give constructive feedback, or maybe even -- call me crazy -- say "thank you" to the people who help you get your work done. It's possible now, but difficult. (Turkers consistently complain about this lack as well.)
3. Make it easy to accept results based on comparisons.
Monitoring HIT quality is a pain, but it's absolutely necessary, because a handful of turkers do cheat consistently. Some of them even have good acceptance ratings. I often get one or two HITs with very bad responses at the beginning of a batch. I suspect that these are cheaters testing to see if I'm going to accept their HITs without looking. In that case, they'd have a green light to pour lots and lots of junk responses into the task with little risk to their ratings.
As long as it's easy to get away with this approach, cheaters can continue to thrive. "Percentage of accepted tasks" is a useless metric when requesters don't actually screen tasks before accepting them. What you want is the percentage of tasks that were screened AND accepted. Some basic, built-in tools for assessing accuracy and reliability would make that possible, effectively purging the market of cheaters.
4. Provide a way for small batches to get more visibility.One of my main reasons for going to mturk is quick turnaround. In my experience, getting results quickly depends on two things: price, and visibility. Price is easy to control. I have no complaints there. But visibility depends largely on getting to the top of one of mturk's pages: especially most HITs or most recent. If you have 5,000 HITs, your task ends up on the front page and it will attract a lot of workers. But attracting attention to smaller tasks is harder. Mturk should make a way to queue small batches and ensure that they get their fair share of views**.
5. Prevent skimming and cherry pickingI've written about this in my last post. Suffice to say that mturk's system currently rewards turkers for skimming through batches of HITs to cherry pick the easy ones. This is not fair to other workers, wastes time overall, wreaks havok on most approaches for determining accuracy, and ruins the validity of some kinds of data. I can't blame turkers for being smart and strategic about the way they approach the site, but I can blame Amazon for making couterproductive behavior so easy. Add a "Turkers can only accept HITs in the order they're presented" flag to each batch and the problem would be solved!
Looking back over this list, I realize that it's become a kind of freakonomics*** for crowdsourcing. There are a lot of subtle ways that a crowdsouring market can fail, and devious people have discovered many of them. In the case of mturk, it's a market in a bottle, so you'd think we could do some smart market design and make the whole system more useful and fair for everyone.
* Right now, one strategy is to dole out the HITs one at a time, so that each one will constantly be at the top of the "most recent" page. But this makes it hard for turkers to get in a groove. It also requires infrastructure -- a server programmed to submit HITs one by one. Most importantly, it essentially amounts to a spam strategy, with all requesters trying to attract attention by being loud and obnoxious. You can't build an effective market around that approach.
** Sites like CrowdFlower are trying to address this need. I haven't used them much -- relying more on homegrown solutions -- so maybe this is a concern that's already been addressed.
*** The original freakonomics, about evidence of cheating in various markets, before the authors turned it into a franchise and let popularity run ahead of their evidence.
Monday, March 19, 2012
Market failure in mechanical turk: Skimming and cherry-picking
This is the first in a series of three posts -- a trilogy! -- about pain points on Amazon's mechanical turk, from a requester's perspective.
I'm a frequent user of mturk. I like the service, and spend a large fraction of my research budget there. That means I also feel its limitations pretty acutely. Today I want to write about a problem that I've noticed on mturk: skimming and cherry picking. (A few weeks ago, I complained about ubuntu. Why is it that we only hurt the computing systems we love?)
Here's the problem: even within a batch, not all HITs are equally difficult. I've discovered that some workers (smart ones) will skim quickly through a batch and cherrypick the easy HITs. For instance, given a list of blog posts to read and evaluate, some turkers will skip the long ones and only code the short ones.
Individually, skimming makes perfect sense. If you do, you can certainly make more dollars per hour. As a bonus, you might even get a higher acceptance rate on your HITs, because short HITs lend themselves to unambiguous evaluation. The system rewards strategic skimming.
But from a social perspective, skimming is counterproductive. It wastes time overall, because time spent skimming is time not spent completing tasks*. It's not really fair to other workers. It wreaks havoc on many approaches for determining accuracy. (As a requester, I've experienced this personally.) From a scientific standpoint, it can also ruin the validity of some kinds of data collection.
I first ran into clear evidence of skimming over a year ago. At first, I didn't want to say anything about it, because I didn't want to give anyone ideas. At this point, I see it all the time. One easy-to-observe bit of evidence: the hourly rate on most HITs will start high, and fall over time**. This is because skimmers grab the quick, easy tasks first, leaving slower tasks for later workers.
I can't really blame turkers for approaching their work in a clever way. Instead, I lay the blame on Amazon, for making counterproductive behavior so easy.
It's especially galling because it would be very easy to fix the problem. On the HIT design page, they should add a "Turkers can only accept HITs in the order they're presented" flag to each batch. For tasks with this flag checked, turkers would be be shown one HIT at a time. They'd be unable to view or accept others in the batch until they'd completed the HIT in front of them***. This would effectively deny turkers control over which HITs they choose to do within a batch****. It would end the party for skimmers, but make the market more efficient overall. A simple tweak to the market -- problem solved.
How about it Amazon?
* You can think about the social deadweight loss from skimming like this:
Let T be the total time all workers spend completing HITs. Skimming doesn't change T -- the total amount of task work is constant. But skimming itself is time consuming. Let S be the deadweight loss due to skimming on a given batch. Like T, the total wage for a given batch is also constant. Call it W.
In aggregate, the effective hourly wage for the whole batch without skimming is W/T. With any amount of skimming it is always less: W/(T+S). So although skimming may improve the hourly wage of the most aggressive cherry pickers, on the whole it always hurts the hourly wage of the mturk market as a whole.
** Yes, yes -- I know that this is not an acid test: there are other explanations for hourly rates that decline over the life of a task. Still, it's good corroborating evidence for an explanation that makes a lot of sense to begin with.
*** Only viewing one HIT at a time might make it harder for turkers to get a sense of what a given batch is like. There's a simple fix for this as well: allow turkers to see the next k tasks, where k is a small number chosen by the requester. This might make it harder to build a RESTful interface for turkers, though. I haven't thought it through in detail.
**** It's possible that requesters would abuse this power by doing a bait-and-switch: showing easy HITs first and then making them more difficult once workers have invested in learning the task. This seems like a minor concern---if the tasks get tough or boring, turkers can always vote with their feet. But if we're worried about it, there's an easy fix here as well: take control of the HIT sequence away from requesters, just like we took it away from workers. It would be very easy to randomize the order of tasks when the "no skimming" box is checked. Or allow requesters to click a separate "randomize tasks" box, with Amazon acting as credible intermediary for the transaction.
I'm a frequent user of mturk. I like the service, and spend a large fraction of my research budget there. That means I also feel its limitations pretty acutely. Today I want to write about a problem that I've noticed on mturk: skimming and cherry picking. (A few weeks ago, I complained about ubuntu. Why is it that we only hurt the computing systems we love?)
Here's the problem: even within a batch, not all HITs are equally difficult. I've discovered that some workers (smart ones) will skim quickly through a batch and cherrypick the easy HITs. For instance, given a list of blog posts to read and evaluate, some turkers will skip the long ones and only code the short ones.
Individually, skimming makes perfect sense. If you do, you can certainly make more dollars per hour. As a bonus, you might even get a higher acceptance rate on your HITs, because short HITs lend themselves to unambiguous evaluation. The system rewards strategic skimming.
But from a social perspective, skimming is counterproductive. It wastes time overall, because time spent skimming is time not spent completing tasks*. It's not really fair to other workers. It wreaks havoc on many approaches for determining accuracy. (As a requester, I've experienced this personally.) From a scientific standpoint, it can also ruin the validity of some kinds of data collection.
I first ran into clear evidence of skimming over a year ago. At first, I didn't want to say anything about it, because I didn't want to give anyone ideas. At this point, I see it all the time. One easy-to-observe bit of evidence: the hourly rate on most HITs will start high, and fall over time**. This is because skimmers grab the quick, easy tasks first, leaving slower tasks for later workers.
I can't really blame turkers for approaching their work in a clever way. Instead, I lay the blame on Amazon, for making counterproductive behavior so easy.
It's especially galling because it would be very easy to fix the problem. On the HIT design page, they should add a "Turkers can only accept HITs in the order they're presented" flag to each batch. For tasks with this flag checked, turkers would be be shown one HIT at a time. They'd be unable to view or accept others in the batch until they'd completed the HIT in front of them***. This would effectively deny turkers control over which HITs they choose to do within a batch****. It would end the party for skimmers, but make the market more efficient overall. A simple tweak to the market -- problem solved.
How about it Amazon?
* You can think about the social deadweight loss from skimming like this:
Let T be the total time all workers spend completing HITs. Skimming doesn't change T -- the total amount of task work is constant. But skimming itself is time consuming. Let S be the deadweight loss due to skimming on a given batch. Like T, the total wage for a given batch is also constant. Call it W.
In aggregate, the effective hourly wage for the whole batch without skimming is W/T. With any amount of skimming it is always less: W/(T+S). So although skimming may improve the hourly wage of the most aggressive cherry pickers, on the whole it always hurts the hourly wage of the mturk market as a whole.
** Yes, yes -- I know that this is not an acid test: there are other explanations for hourly rates that decline over the life of a task. Still, it's good corroborating evidence for an explanation that makes a lot of sense to begin with.
*** Only viewing one HIT at a time might make it harder for turkers to get a sense of what a given batch is like. There's a simple fix for this as well: allow turkers to see the next k tasks, where k is a small number chosen by the requester. This might make it harder to build a RESTful interface for turkers, though. I haven't thought it through in detail.
**** It's possible that requesters would abuse this power by doing a bait-and-switch: showing easy HITs first and then making them more difficult once workers have invested in learning the task. This seems like a minor concern---if the tasks get tough or boring, turkers can always vote with their feet. But if we're worried about it, there's an easy fix here as well: take control of the HIT sequence away from requesters, just like we took it away from workers. It would be very easy to randomize the order of tasks when the "no skimming" box is checked. Or allow requesters to click a separate "randomize tasks" box, with Amazon acting as credible intermediary for the transaction.
Subscribe to:
Posts (Atom)