How I Created a 40,000 Labeled Audio Dataset in 4 Hours of Work and $500

I’ve been building Deep Learning Products for a few years now and one of the requirements I request of clients before I start a new project is that they must already have a dataset. It has always been a requirement because it ensures I don’t have to do any dataset creation work which I don’t like much. It usually takes a while and can become very tedious, especially if there is no publicly available data source available. Although to be fair, many data sets are available and more show up every day, which you can now search for now using Google’s new dataset search tool.

Well, a few months ago I was approached by a reader’s company that wanted help building a product, but they did not have a dataset, and this time I decided to go for it. I figured it would be an excellent time to explore the process and build my skills.

This is when I discovered the power of Amazon Mechanical Turk for deep learning engineers. After 4 hours of coding and legwork, I had published my first HIT. 40 hours later I had a dataset containing 40k labeled images, and my wallet was only 500 dollars lighter. I felt like I had just used a cheat code for work. I want to share some things I learned using MTurk and hope it helps you build custom deep learning datasets for your projects going forward.

Trying to build a custom dataset? Not sure where to start? Join me for a 30-minute chat to talk about your project. I’ll help you see the problem more clearly and help cut through the noise. Sign up for a time slot

What is Mechanical Turk

Mechanical Turk or MTurk is a crowdsourcing marketplace where you can publish and coordinate a wide range of Human Intelligence Tasks (HITs). Tasks can be anything such as classification, tagging, surveys, transcription, data collection, and anything you can make into a website. Workers can view/choose your task and earn a small reward for each task.

The platform provides many useful tools to create your job and make it easy for the workers to complete your task successfully. Assuming you want to create a dataset of 1000 labeled images you could charge $0.25 per 10 images classified and get your full dataset for 25USD. The best part is that you will have the dataset in as soon as a few hours.

Image for post

Getting started with a real-world example

1: Choose the type of HIT

MTurk has a variety of HIT templates to choose from that make it easy to start. They range from survey templates to transcription of audio and video. Take some time to go over the different projects to get a sense of what you can do.

If you don’t see a project template that you could use you can always use the general survey link template and have it link to a custom webpage which could allow the workers to do any task. I have been very successful in using this method to collect non-standard training data.

2. Set the properties of the HIT

HIT properties are super important in making your data collection successful. HIT properties include mundane things like Title and description but also have lots of essential settings like the Number of assignments per HIT and reward per HIT.

Reward per HIT is how much you are going to pay for the worker to complete your task. When trying to calculate a fair pay rate for the HIT take the time to do it yourself and time it. See how long it takes you to complete the task and then calculates the equivalent pay is assuming the HIT took 1 hour and you paid minimum wage.

I’ve used this method countless times to get a starting fee for all of my tasks and the quality of results has always been high. Just keep in mind that the harder the work, the more you should pay to keep up quality. I’ve gone as high as $2.00 for a job that takes 10 minutes.

Time allotted per assignment is also a critical thing to consider. I found this out when I assumed that it was just a number that workers look at to see if they have time to complete the task. I learned that it limits them to this amount of time as soon as they start the work. This caused a few workers to have issues when they ran out of time mid HIT and were not able to collect their reward!

Some other parameters that are useful:

  • HIT expiration
  • Autoapprove and pay workers
  • MTurk Masters
  • HIT approval rate
  • Number of HITs approved
  • Location Specific workers
  • Specific Language skill requirements

3. Customize your HIT HTML template

With your parameters set, you can create or change the HTML design that all workers will see when they are decided to take your job or not. MTurk provides you with a very user-friendly HTML editor to help you out. If you want a very customized template, you can also edit the HTML source directly.

Best Practices

Take your time to create instructions for the task because it can mean the difference between good data and bad data. This is the most critical component of every HIT. keep your instructions straightforward and clear to avoid worker errors

Make clear evaluation policies for your project

You should be evaluating the results of your HITs manually. When you make the evaluation policies, clear workers who are unable to meet these metrics have a chance not to accept the HIT instead of completing it poorly which would waste your time and their time.

Include gold standard items

Gold standard items are those that you already know the answer to. When you include these items into your HIT, you can have a way to weed out poor workers’ work during evaluation time automatically.

Evaluate the results you get

Always evaluate the results you get. You can choose to do this automatically or manually, but this is a crucial step. Failure to check the work of workers invites poor quality and outright fraud.

I had a case recently where a worker was submitting results generated by an automated system. Luckily my gold standard items and the time I spent evaluating results alerted me to this bad actor, and I was able to block their account from submitting any more jobs.

Trying to build a custom dataset? Not sure where to start? Join me for a 30-minute chat to talk about your project. I’ll help you see the problem more clearly and help cut through the noise. Sign up for a time slot

Final Thoughts

Mechanical Turk is a fantastic tool for the deep learning engineer and team. It can be used to build datasets that only large companies could accomplish in a fraction of the time and cost. Its API also allows you to automate large portions of the work that’s left. I’ve been able to go to fully automated evaluation and job posting.

Beyond the obvious cost and time benefits is the fantastic community of workers. They come from all walks of life and can surprise you with how far they are willing to go to complete a 10-minute task to your specification. I recently developed a dataset that I’m using to build an elderly monitoring and alerting system, and I had 10’s of emails thanking me for developing this system and a few e-mails letting me know how my HIT could be better made to help the workers complete it.

Also read: Understanding Machine Learning Data Types 

Thanks for reading 🙂 If you enjoyed it, drop me a comment and share the article with anyone you think needs it. Let’s also connect on TwitterLinkedIn, or follow me on Medium