FAQ | EvALL

What is EvALL 2.0?

EvALL 2.0 (Evaluate ALL 2.0) is an evaluation tool for information systems that allows the evaluation of a wide set of metrics covering many evaluation contexts, including classification, ranking, or LeWeDi. EvALL 2.0 is designed on the following concepts: (i) persistence, allowing users to save evaluations as well as retrieve past evaluations; (ii) replicability, all evaluations are performed following the same methodology, making them strictly comparable; (iii) effectiveness, all metrics are encompassed under the theory of measurement and have been doubly implemented and compared; (iv) generalization, facilitated by the use of a standardized input format that allows users to evaluate all evaluation contexts.

Additionally, EvALL 2.0 uses the PyEvALL library (The Python library to Evaluate ALL) to manage evaluations efficiently and replicably. PyEvALL allows the evaluation of a wide range of metrics and ensures the persistence of results, so that the user can easily retrieve previous evaluations.

What is EvALL 2.0 used for?

The ultimate goal of EvALL 2.0 is to provide an advanced evaluation platform for NLP systems where in just four steps, any user, regardless of their profile and experience, can evaluate their models. Specifically, the user only has to: (i) upload the gold standard file; (ii) upload one or more files with their model's predictions; (iii) select the desired metrics; (iv) click the Evaluate button. Specifically, EvALL allows:

Perform evaluation against a gold standard that is in the EvALL 2.0 Repository.
Perform evaluation by uploading your own gold standard.
View results both textually and graphically through the Evaluation Dashboard.
Select from a broad set of metrics covering 6 different evaluation contexts.
Analyze your evaluations and detailed disaggregated analysis through the PyEvALL console and the PyEvALL report.
Publish your results in the EvALL 2.0 repository.
Publish your gold standards in the EvALL 2.0 repository.

What metrics does EvALL 2.0 include?

EvALL offers different metrics to use depending on the evaluation context: Accuracy, System Precision, Kappa, Precision, Recall, F-Measure, ICM (Information Contrast Measure), ICM Norm, Precision at k, R Precision, MRR, MAP, DCG, nDCG, Cross Entropy, ICM-Soft, and ICM-Soft Norm.

What evaluation contexts does EvALL 2.0 include?

EvALL 2.0 allows evaluation in the following evaluation contexts:

Single-label classification without hierarchy: an evaluation context in which each instance is assigned one target class, and only one. Moreover, the classes have no order or hierarchy among them and all have the same relevance. The metrics available for this context are: Accuracy, System Precision, Kappa, Precision, Recall, F-Measure, ICM, ICM Norm.
Multi-label classification without hierarchy: an evaluation context in which each instance can be assigned one or several classes, from a set of target classes. Moreover, the classes have no order or hierarchy among them and all have the same relevance. In this context, the metrics Precision, Recall, and Fmeasure can be evaluated.
Single-label hierarchical classification: an evaluation context in which each instance is assigned one target class, and only one. Moreover, the classes have a hierarchical relationship, so that errors between classes at the same hierarchical level represent a lesser failure than errors between classes at different hierarchical levels. In this context, the metrics ICM and ICM Norm are available.
Multi-label hierarchical classification: an evaluation context in which each instance is assigned one or several classes, from a set of target classes. Moreover, the classes have a hierarchical relationship, so that errors between classes at the same hierarchical level represent a lesser failure than errors between classes at different hierarchical levels. In this context, the metrics ICM and ICM Norm can be used.
Ranking: in the ranking evaluation context, the metrics aim to quantify to what extent a ranking produced by the systems is compatible with the relevance values assigned in the gold standard. In this context, the following metrics are available: Precision at K, R Precision, MRR, MAP, DCG, and nDCG.
LeWiDi: an evaluation context in which each instance has a probability distribution for all possible classes. To evaluate in disagreement contexts, the metrics Cross Entropy, ICM Soft, and ICM Soft Norm are available.

What do I need to use EvALL 2.0?

EvALL 2.0 is an application aimed at registered users, primarily due to the need to control the evaluation processes on the server due to excessive resource consumption. To do this, you only need to fill out a small form with some basic information and an email address. The process is done in two steps, firstly the user must complete the registration form and send it, and in a second step, this request must be accepted by an administrator.

How can I register?

If you have an account from the ODESIA group of applications, you only need to click on the Login/Register item in the main menu and use the credentials. If you do not have an account, on the same Login screen you can go to the option of "Create new account" and fill in the mandatory details "Email address" and "Username". Subsequently, you will receive an email with the password.

How can I evaluate my models with EvALL 2.0?

To evaluate a results file in EvALL, the first necessary step is to create an evaluation. For this, on the "My evaluations" screen, you need to click on "New evaluation" and select whether it is an evaluation using a gold standard included in the repository or a gold standard uploaded by the user.

Gold standard included in the repository: when selecting gold standard from the repository, all the necessary definitions are obtained directly from the repository stored in the system, so you only have to search for the gold you want to use.
Gold standard uploaded by the user: this option allows you to create a gold standard from a customized file, for this it is essential that the file format meets the requirements explained in this same FAQ. Once uploaded, you must select the file format and hierarchy if necessary.

Once the evaluation is created, to evaluate a results file, on the Results Dashboard, a form is displayed in the left column in which you can upload one or more predictions files and then select with which metrics you want to evaluate from the list. Finally, the "Evaluate" button initiates the evaluation process with the previously selected gold standard and the results will be displayed both in the results table and in the charts.
EvALL 2.0 automatically stores all evaluations conducted by a user, so at any time you can access them for management. Through the "My evaluations" page, a user can access their past evaluations to view their results, evaluate new systems and compare them with previous ones, delete evaluations stored in the database, or create new evaluations.

What format should the gold standard and prediction files be in?

The default format for EvALL is json given its great versatility as well as the ease of controlling potential errors according to the format. For this, a schema has been created in which the permitted attributes are defined, as well as their types, declaring for each one whether they are required or not.

Specifically, each attribute represents: 
test_case: in string format a specific experiment or use case, which could be, for example, different executions of a classification algorithm, or different queries in a ranking context. 
id: accepts both string or int format, but it must be the same for both the predictions file and the gold standard. The unique identifier of the instance in the dataset. 
value: this field represents the value assigned to each item and whose type will vary depending on the applied evaluation context. For example, for single-label classification, the element will consist of a string, while for multi-label classification, the element will consist of a vector of strings.

This format can be adapted for different evaluation contexts, below are examples of the format depending on the context:

Single-label Classification Format

This is the typical format for single-label classification tasks where each item has a single associated class. In this format, the label will be any string. An example for this format is:

In the example shown, the array consists of three elements belonging to the same test_case, "EXIST2023", with three different identifiers (I1, I2, and I3), and three different target classes ("A", "B", and "C").

Multi-label Classification Format

The multi-label classification format is one in which each item may be classified with one or several target classes. For this reason, the format of the files to send for this format consists of the same elements as in the previous case, with the difference that the "value" attribute in this case is an array of elements. These elements in turn must be strings. An example for this format can be found is the following:

As can be seen in the example, the file consists of an array of json objects with three elements with the same fields as in the previous case, but with the difference, in this case, that the "value" attribute is composed of an array with the target classes of each item.

Classification with Disagreements Format

The classification with disagreements format allows each element of the dataset to be assigned a probability distribution associated with each class. That is, instead of selecting a single absolute category for each item, the distribution of labels by annotator is assigned to each element. In the following example, the format used by EvALL is similar to the previous ones, with the exception that in this case the "value" attribute is represented with a Python dictionary where each element represents a target class and its value the probability of being assigned. Note that in the case of single-label classification with disagreement, the sum for each element must add up to 1, while for multi-label classification it is not necessary.

EvALL Classification with Disagreement Format

Ranking Format

In the ranking evaluation context format, each item is assigned a value indicating the ordering position in the ranking, in the case of the predictions, and the relevance value, in the case of the goldstandard. As can be seen in the following example, the format is exactly the same as in the previous cases, but with the difference that in this case the value of the "value" attribute is formed by numbers representing, in each case, the ordering or relevance of the item.

In the image, the predictions of a ranking system are shown which has indicated that the item with identifier "A" has been assigned the position 1 in the ranking, while the item with identifier "B" has been assigned the position 2 in the ranking.