15. definedcrowd confidential 15
The challenges of crowdsourcing NLP data
Crowd quality Data quality
• Language tests
• Job specific tests
• Real Time Audits
• Built-in language/spam
validators
• Referral system
• System of tokens
• Legal/privacy compliance
(under NDA)
Quality
gateways
Controlled
crowd
• Checking for suspicious
crowd behavior (multiple
accounts creation, peaks of
activity, specific job spam, IP
check against country of
living)
Machine
Learning
Data
quality
control
• Validation steps
• Inter-annotator
agreements
• Precision and Recall
metrics
Hi, my name is Daniela Braga and I am the founder and CEO of DefinedCrowd.
Some say that entrepreneurship is like jumping out of a plane without a parachute and building one on the way down.
Many people ask me why did I leave my comfortable corporate job to start this company and to live in the last 2 years a life of excitement but hardship and uncertainty. I usually say that it was because of 3 reasons.
I have 17 years of experience in the field of NLP, HCI, or what is now called, AI. In the last 7 years I moved to data science roles where I was responsible to collect and structure data for the scientists to train speech and language models. I’ve worked at some point with 50 languages in parallel. And independently of the method I was using to do this – inhouse or using vendors, the challenge was always getting high quality consistent data.
Each day, humans create 3 stacks of Empire State Buildings of data.
90% of that data is unstructured, but machines need structured data to learn.
Now, only a few data science companies are looking at the data problem at an enterprise level.
We are solving the data problems.
When I started my career, we would build a lot of the dialogue systems components using a system of rules. During my PhD, I built a TTS system mainly based out of rules for the Portuguese language.
But with the recent advances of data science, something changed in the way we teach machines. They don’t learn with a few rules anymore.
They need machine learning models and LOTS and LOTs of training data.
Introducing DefinedCrowd. Our platform allows data scientists to collect, enrich and structure high quality training data at scale.
We do so, by combining tools, humans-in-the-loop and machine learning models into specialized AI workflows.
Anna comes to the DefinedCrowd platform and goes to the “Build” tab. Here she picks this “multimodal data” workflow.
Then a workflow assistant is displayed, with numbered steps that will light up as she is configuring the settings.
First, she will configure the crowd setting such as language, gender, age and country of living.
Then recording setup: at home and in a quiet environment, with microphones placed at 40 inches.
And she will upload the scenario instructions.
Next, she will configure the video settings and disable the audio and the other signals this time.
Because she wants to understand emotion correlated with vital signals, she will pick heart and respiratory rates.
All looking good, the calculator tells her it’s going to cost this, so she will push the submit button.
The campaign is now finished and she’s going to check the results of her collection.
Here she can see things like how many hours are completed, crowd demographics, timeline, quality, etc.
Finally, Anna will send this data for validation and annotation, which are different workflows that can be found in the Build tab that we’ve seen before.
To review: DefinedCrowd allowed Anna to extend her data science capacity by giving her a shopping tool for sophisticated and meaningful data set.
Existent SaaS platforms, like Crowdflower and Mturk, can’t maintain quality (they are 50% less accurate) and require 6 months of iterations to perfect a process.
Traditional professional services, like Appen or Isoftstone, have higher quality but have no control over the process.
DefinedCrowd brings the best of both worlds: the high quality and reach of the professional services combined with the scale and control of the SaaS platforms.
We’re a Seattle-based company with R&D in PT. We’ve been partnering and servicing the biggest players in AI, mainly fortune 500 companies, in the US, Japan and Europe.
We’re DefinedCrowd and we’re making machines smarter! If you’d like to try our platform, go to definedcrowd.ai/disrupt or shoot me an email!