Info: How good are the AI annotations in my project?

felix · September 6, 2023, 9:18am

The short answer is: This highly depends on your project, e.g., which kind of texts do you want to annotate and how difficult or complex are the categories. But generally, if you think your project is “difficult”, e.g., would require extensive training of (human) annotators, it is likely that you will need to add more manual annotations for the AI to learn the characteristics of your categories.

Let’s go a bit into the details of why that is.

Easy projects
In some projects, we see that even only 1 or 2 annotations made by you (for each category) will suffice to achieve almost 100% correct annotations made by the AI. For example, if you have movie reviews or product reviews and you want to classify them in positive or negative sentiment (a prime example of sentiment analysis) because typically the sentiment can be detected quite obviously - both for humans and current AI techniques.

Difficult projects
In other projects, you may need many more annotations for each category. Indicators that your project may require more manual annotations in order for the AI to perform well include:

categories are very fine-grained
categories partially overlapping (as to their meaning)
the text at hand is of low quality, e.g., contains many typos, incorrectly written words
when manually annotating, you find it difficult to choose the right category or you have a feeling that a category may not fit 100% in some cases

Generally speaking, if you have an annotation project that requires extensive training of (human) annotators, then it will typically require also more training of the AI, i.e., by adding more manual annotations. Pay particular attention to adding good annotations only because if you add “sloppy” annotations (in a sense that you’re not entirely sure whether your annotations are really correct) the training data for the AI will also learn from these potentially bad annotations.

Evaluation view to the rescue
There are multiple ways of determining how well the AI’s annotations are. One is that you can simply get a feeling for them by looking at them, e.g., in the Review View. Another one is our AI Insights View, which will show you the quality of the AI’s annotations. This is basically done by internally creating automated annotations for those sentences that you already annotated and comparing the automated annotations with your (correct) annotations. The results of this analysis are then shown in the evaluation view. We will describe these metrics in a future post in more detail, but for now just know that the F1 score is a common measure in machine learning and is a number between 0 and 1, with 0 is the worst (everything is incorrect) and 1 being the best (everything is correct). Just as in regular annotation projects with multiple annotators, you will typically never see 100% correctness, but if you have sufficient training data, i.e., manual annotations, in your project, the AI will be able to get close to it

Feature Update
The current view just shows the bare numbers. We added this quickly earlier so that the app can already show some insights regarding how good the AI annotations as to quantitative evaluation metrics. However, this rather “functional” view was never meant to be a long-term option but just an intermediate one. Right now, we’re working on a completely revised AI Insights view that will show you the most important information as to the AI’s annotation performance in an intuitive way. If you’re interested you will of course still be able to also look at the bare numbers.

Any questions regarding evaluation or how to improve the AI quality? What do you want to hear next? Just let us know