Data Labeling: Common Pitfalls and How To Fix Them with Datasaur

Ivan Lee: Good morning and good afternoon to everyone. I'm excited to come here today and talk about one of the most overlooked problems in the development of natural language processing specifically. The various types of NLP, traditional NLP as well as these lessons I think are applicable to LLMs today.

So let's go through and um look a little bit at the last few years of development. This is a quote I like to bring up from 2020 and you can see just how far we've come since then right. Harvard Business Review was predicting three years ago that the next big breakthroughs in AI will be about language, and we have seen the surge of development and really production readiness in NLP models since then.

Just to share a few of the use cases that we've been seeing across many different industries and functions what I really appreciate about NLP is its applicability to many different Industries out there anything from the legal space through to medical e-commerce and really just it's touching so many of the the real world use cases out there. I think what's really exciting is that this is now applicable to many of our jobs and everyone's trying to figure out especially with the advancements of chatGPT they're starting to use this in their own personal lives and now people are really wondering how they can apply this to their day-to-day jobs.

On top of that it goes a lot beyond just chatGPT and LLMs right there are many different use cases that we're seeing across those same industries anything from the basics like entity recognition and sentiment analysis and text classification through to more advanced types of uh of NLP like aspect-based sentiment analysis or co-reference resolution right and all of these different technologies are coming together again to solve real world problems. I think that in order to continue with these advancements label data is the foundational building block that will actually power all of that. So you can see here languages are very challenging right. Even for the single word “run” there are over 200 known definitions of the word “run” and you see a few of those examples here. And so in order for us to continue teaching machines about how our language operates we're going to need to give it a lot of lot of labeled data and here is my own prediction right even with all the progress we've made over the last decade I still believe that less than .001 of all the required labeled data has been labeled so far this is going to be fundamental for us to continue to be able to push these through into the real world applications.

Unfortunately data science and NLP courses do not teach the basics right we learn everything we need to about how models work what kind of algorithms what developments we've had and what the latest state of the art looks like but when it comes to the actual pragmatic capabilities of labeling and uh getting your training data we often are left to our own devices and everyone is often reinventing the wheel and so what I want to talk to through today is less on the research and academic side and more on the pragmatics. On day one you sit down, you want to train a new algorithm, what should we be looking for?

As a side note if you've been paying attention to the images so far everything here is through generative AI, so don't look too closely at these kids fingers

All right let's get to the actual intro.

My name is Ivan, I'm the CEO and Founder here at Datasaur AI. A little bit about us we have been in operation for the last four years now. We're funded by the initial S capital the CTO of Segment and the CTO and now president of Open AI. We have over 5 million labels applied on our site on a monthly basis and we are serving a lot of the top organizations and institutions around the world, so for today's talk I'll be drawing on our experience interviewing and working with over 200 ML teams around the world.

So what's the problem here, right? What we've learned is that machine learning teams hate wasting their valuable time and resources labeling data. We really want to spend time on the fun stuff right these the green area which is the model development the training the tuning itself, but in reality an industry-wide survey has shown that the majority the vast majority of our time is is spent in just gathering and and cleaning and training and getting the training data to begin with.

Let me know if this sounds like a familiar situation for you. We have a lot of raw data, it needs to be labeled, we need a lot of labels, in fact closer to like millions of labels by the end of next week, and there's only three of us. Right, what can we do here?

Well, people jump to a lot of commonplace solutions right you could start with hey there's a lot of labeling experts out there I get messages in my inbox, people asking if we need labeling Service. There's got to be lower cost labor available somewhere and that should be easily scalable right problem solved unfortunately as machine learning teams really get into this they learn that training labelers can often take weeks the guidelines that they need to write up they start with you know a couple of paragraphs should be super super easy but then as they really get into it they add more and more guidelines they answer more questions and suddenly those guidelines are over 20 pages right.

Again, I’ve been there. I have written up a 30-page guideline on how a specific job should be done and then there's shouldn't be any surprise that with all of this there's a high turnover rate people get tired of doing this they're not happy with this work it's it's gotten very complex and now you have to retrain entirely new labelers which again takes weeks. So what I'm hoping to cover in um in the next 20 minutes is talking about how we can solve this right uh identifying where the real problems are even before we begin, and by answering a lot of these questions in advance and planning for this ahead of time we're able to we've seen teams be able to cut down their their timeline from over a year to under three months.

So without further ado let's jump into what these tactical questions are like we're going to start off with the top three here there is the first question of do you want to go with a crowdsourced option do you want to go with a managed service or do you want to build some some workforce out internally right well in order to ask to answer that question we actually have to jump forward to the next couple of questions how important is the quality of the data if you have if you're working in the health care field and there are lives on the line then qualities of the utmost importance right you are willing to have the same data labeled by three even five people just to make sure that they're they agree and we really have the highest quality training data on the other hand sent him an analysis for say tweets or Yelp reviews we can excuse a little bit of uh of you know some number of errors there.

So we have to first determine what is our quality threshold and you compare that with the timeline right when do we need this data by when are we going to actually require this and of course budget and it's that typical triangle of you've got three options choose two maximum right so if you wanted to really optimize for quality you might be leaning towards having an internal workforce team that you can train up internally their full-time employees or contractors and you're able to manage personally but on the other hand if you have a very quick um very rapid deadline approaching and you just need to get this done as quickly as possible you may not have the time to build that out in-house and you might want to go with an existing outsourced service.

There's also the level of domain expertise required right so uh sorry meant to go through some of these questions earlier um there are some of these there's domain expertise required as well so you may not have the option of going with a crowdsourced or a managed service you may need to have let's say a certified healthcare workers working on this or doctors or lawyers right then you might need to be building that in-house again so by answering all of these questions you will be able to kind of zero in on what is the one-year plan? What are we going to do just to get to the first through the first milestone of this project? And then you should have you should start planning ahead and thinking through long term. Is this going to be an ongoing project for us? Are we going to be committing to this for the next three to five years? And if so is there a timeline where it makes sense for us to switch from working with an external vendor to bring this in-house?

Just this quick side note um this is just an objective observation that we've seen a lot of people start with Mechanical Turk but we've also seen every single customer that starts with Mechanical Turk eventually move off of it it's a well-known brand it is you know Amazon has great um Services uh that are alternatives to Mechanical Turk on their Marketplace but it wasn't originally built for this type of AI data labeling and just the error rate that that comes through the turnover uh just makes it so even if it's the lower cost option to begin with the costs quickly outweigh the benefits here moving on this is um there's the question of ethical sourcing.

Just to be clear this is a business problem in addition to an ethical one right so you have to think through as you work with these vendors they can't just be nameless faces you have to understand who is doing your labeling right under what conditions for how many hours as you interview um some of these services you really need to be asking these questions up front because uh if you don't this absolutely can come back to bite you right there were there have been there's been article after article every three to six months about the terrible conditions um under which some companies are employing these contractors and oftentimes the folks who made these um these hiring decisions aren't even aware right they just haven't thought to think through and ask these questions so it is important to keep um to understand uh specifically the training the equipment provided the job security and that really kind of goes into the next point here– it's important to take a moment and think about what this job looks like from a labeler's point of view right?

There is a lot of tedious boring and repetitive work um involved and they know right they know that they're constantly being evaluated they're fearful for their job security um they know that this job could go away quickly and they also are very aware that they are trying to build the system that will eventually automate their job away the escalation system can be challenging especially if they're in different time zones across the world now when they have questions they have to wait for a 12-hour time cycle at the very minimum to get answers to what should I have applied in this particular case, right?

Oftentimes it becomes this faceless corporate employer, this boss and so naturally with all of this they need to start learning how to game the system and how to get the most out of this job while they have it. And so a lot of people I talk to complain about how you know this task should be super simple but the work that they get is just never quite right why can't they ever get it right right and it's for all these many different conflict situations and and work environments that make it harder to really establish um what needs to be done, and that can lead to a lot of errors and constant redoing of of the work data privacy and compliance.

This is a very hot topic these days in the world of AI, you have to think about who can actually access your data right? What are the regulatory requirements this can differ from state to state from country to country you have to assume a worst case scenario if there is a data leak, what happens here right so again you need to establish policies very upfront very at the very start of the project as to what kind of data is being transmitted and what are the processes that you have in place in order to ensure that you know this data isn't late. In a worst case scenario what will we do assuming there is one designing the right job okay so this is a very common Pitfall that I see especially when I'm talking to engineers when they're designing these let's talk through a very common one right in my previous roles I've had to do this at different companies uh it's content moderation trying to understand whether something is appropriate or not and back then.

I started off very naive right I thought I could just go through we've got rules right for movies watch PG-13 what's rated R I can just copy and paste those same guidelines not at all true those are very very vague guidelines and it is uh incredibly difficult to to maintain consistency a lot of Engineers come in thinking well honestly I could probably write a few rules or write a code up a quick algorithm right why do we even need a human to begin with but the thing is if you could just write this um write this algorithm up algorithm up. Why do we need AI right? That is the fundamental question like that was the fundamental need to begin with in designing this job on top of that you have to account how are you going to account for human error or human bias or fatigue right these things naturally happen so a very common thing that we have to do is assign at least two people to look at any given project and then you have a reviewer look at where those two people disagreed and that can help mitigate you know eighty percent ninety percent of the initial error rate.

But again that adds time that adds cost to the task required when you're seeing these errors you have to dig in and start looking at the projects yourself is it that your labelers are messing up or is there just natural ambiguity and the task wasn't well defined to begin with should these labelers be channeling the end user right should they be pretending to be an end user or are they channeling your product division are they expected to be experts at your product or are they coming in intentionally uh trying to be naive and approaching this from the user perspective?

And finally how much time do you expect to be spent per task right there's a lot of these where people will charge by the hour and you're trying to get as many tasks in as possible any given minute any given hour but by setting those expectations up front and and having a timer on each and every task that can prevent them from really taking the time through to research and understand uh what the task is supposed to be like to begin with so how do we begin mitigating this right there's a lot of different questions. It feels like all I've been doing is really asking a lot of questions well first you can start by going starting small right and reviewing and repeating you don't want to be sending all of your data over at once you want to start with a small data set see how that performs try and identify the edge cases and then move on from there on top of that labeling redundancy is very important right the ability as I said earlier to have multiple people looking at the exact same data comprehensive guidelines.

This goes hand in hand with the next topic. We think that these projects are very simple to start with but as you go through and start labeling some of your own data you will start you will very quickly identify edge cases so you'll have to think about this at a systems level you have to think through not only the baseline what are we expecting but what are the edge cases that we haven't considered yet what are the ways in which somebody could gain the system or users are going to try and bypass some of these filters and then we should figure out where we want to draw the line there right and lastly providing kind and empathetic feedback right we thought we talk a lot about usability guidelines for end users but when we're working with these outsourced um with you know internal or outsourced vendors we need to be thinking about the human on the other side of that too and what they're trying to accomplish and how we can align best with them final topic here of course it is also important to choose the right tools right um a lot of companies that we start that we talk to start off by labeling on spreadsheets and that's totally fine right for one person or two people doing the labeling.

I think it is a perfectly reasonable way to to start we're all very familiar with spreadsheets as that begins to scale people start thinking oh well you know I really wish that we could do this or we could automate this on the spreadsheets or we could have you know easier ways to select from our drop down options so again as Engineers we often think I could build a web app for that pretty easily right um but what I want to just share today and you know this is this often comes as a surprise is that there are many readily available tools out there and they all come with their pros and cons right so you have to think through and explore what are the right tools for you. Full disclosure right. Data store is an NLP labeling platform and so we have spent a lot of time and effort building out these tools but there are many different options and depending on the specific use case you have and um where you are at the stage of your labeling you'll want to evaluate what is the right what what has the right cost benefit analysis for you and your team.

My mission right when I started this company four years ago– it's because I had spent the previous seven years as a product manager on the other side of the table– I had built the same in-house tooling solution three times over before I decided I didn't want anybody wasting their time and resources doing the same work that I had over the last seven years. So I want to make sure nobody ever devotes their time to building their own in-house tools again you wouldn't rebuild Photoshop for ores or figma for your designers you similarly should not be rebuilding a labeling solution from scratch right.

I also want to improve productivity and quality as NLP becomes ever more in the spotlight I want to start sharing a lot of these industry best practices and having people ask and answer these questions in advance without having to learn them the hard way right and then just helping you be a thought leader for your team help for all those who are trying to push for adopting NLP at your company, I want to be your Ally and partner in helping make sure that that is successful.

I'll close here which is a very quick live demo right we've talked a lot about a lot of these principles in the abstract I want to just share a little bit about what professional tooling can look like and how this can really impact your job welcome to Datasaur I'm just going to give the two minute Whirlwind demo here right but in Datasaur what we have here is just a text document that we've loaded up and you can go start by selecting any of these spans and applying labels to this very quickly right so I can just go through these correspond to keyboard shortcuts and I can apply whatever label is appropriate.

I can even draw relationships from one thing to another so you can have these you can have the relationships between different entities but what I really want to share is that there's a lot of advancements you may have seen plenty of tools that are similar to this right, but we are evolving very rapidly as an industry. One of those is the ability to actually use the advancements in chatGPT to label the data itself, and so our most popular feature by a long shot in the last um couple of months has been our integration with OpenAI there was a lot of there were a couple of different research articles uh demonstrating that OpenAI's chatGPT could actually outperform human annotators on four out of five labeling tasks. And so now you can actually take a text classification project like this one and you can write your prompts directly within Datasaur and call the OpenAI API and in this case the system prompt is you are an expert data labeler in classifying categories right and you can write this prompt predict the labels.

And let's see if the API is live right now, it is and it can automatically apply those topics so not only should you be looking for a solution where the interface is really easy to use and pick up and start labeling but you also want to identify what level of Automation and what kind of additional features and capabilities are relevant for you and your projects and last but not least as you scale it's also important to look for workforce management capabilities oftentimes again what we see from customers is that they start off by having yet another spreadsheet tracking who is working on what project and how much progress have they made in tools like ours you're able to automatically assign and track the efficiency and the quality and the throughput of your work so as you scale to 5, 10, 50 labelers you're really able to get a better sense for how much progress you're making in your in your training data and what impact that's having on your resulting ML capabilities.

Now moving back to tactical takeaways right you've sat with me for the last 25 minutes um what are some things that you can take back to your team? I think it starts with trying to First understand how you're going to work on the labeling service will you go with an external or an internal Workforce or are you going to use some kind of hybrid and transition between the two um will you have will you need subject matter experts that can certainly influence your decision on which way to go um and finally thinking through ahead of time the compliance and regulatory requirements for how the data is handled right and then on to the labeling tool what type of interface would you actually need is automated labeling applicable is that something that you want to leverage to get a head start with your work um what level of data isolation and permissioning is required as part of that tool right and how do you intend to manage your workforce and is that something that the software can help you handle so thank you for attending today. If you have any further questions you can reach out to me at any time at the email listed Ivan datastore.ai you can also feel free to send a connect on LinkedIn I'd be happy to consult with you and your team on any of the best practices that we share today.

Subscribe to our resources and blogs

Subscribe