API

This this is the full API reference of all public methods.

Metrics

irmetrics.topk.ap(y_true, y_pred, k=None, relevance=<function multilabel>)

Compute Average Precision score(s). AP is an aproximation of the integral over PR-curve.

Parameters
y_truescalar, iterable or ndarray of shape (n_samples, n_labels)

True labels of entities to be ranked. In case of scalars y_pred should be of shape (1, n_labels).

y_prediterable, ndarray of shape (n_samples, n_labels)

Target labels sorted by relevance (as returned by an IR system).

kint, default=None

Only consider the highest k scores in the ranking. If None, use all outputs. The minimum between the nuber of correct answers and k will be used to compute the score.

relevancecallable, default=topk.relevance.multilabel

A function that calculates relevance judgements based on input y_pred and y_true.

Returns
apfloat

The average precision for a given sample.

References

Wikipedia entry for Mean Average Precision

Examples

>>> from irmetrics.topk import ap

for ground-truth labels related to a query:

>>> y_true = 1

and the predicted labels by an IR system:

>>> y_pred = [1, 0, 0]
>>> ap(y_true, y_pred)
0.3333333333333333
>>> # This should be fixed
>>> y_true = [1, 4, 5]

and the predicted labels by an IR system:

>>> y_pred = [1, 2, 3, 4, 5]
>>> ap(y_true, y_pred)
array([0.2, 0. , 0. ])
irmetrics.topk.dcg_score(relevance, k=None, weights=1.0)

Compute Discounted Cumulative Gain score(s) based on relevance judgements provided.

This is provided as internal implementation for ndcg for this reason the API for this function slightly differ: it alawyas accepts and outputs np.arrays, unlike other methos in this module.

Parameters
relevanceiterable or ndarray of shape (n_samples, n_labels) or simply

(n_labels,). The last dimension of the parameter is used as position. The relevance judgements provided by experts.

weightsdefault=1.0, scalar, iterable or ndarray of shape (n_samples,)

takes into account the importance of each sample, if relevant.

kint, default=None

Only consider the highest k scores in the ranking. If None, use all outputs.

relevancecallable, default=topk.relevance.multilabel

A function that calculates relevance judgements based on input y_pred and y_true.

Returns
dcgnp.array

The discounted cumulative gains for samples (or a single sample).

References

Wikipedia entry for Discounted cumulative gain

Examples

>>> from irmetrics.topk import dcg_score

for ground-truth labels related to a query:

>>> relevance_judgements = np.array([[1, 0, 0, 0]])
>>> dcg_score(relevance_judgements)
array([1.])
>>> relevance_judgements = np.array([[True, False, False, False]])
>>> dcg_score(relevance_judgements)
array([1.])
>>> relevance_judgements = np.array([[False, True, False, False]])
>>> dcg_score(relevance_judgements)
array([0.63092975])
irmetrics.topk.ndcg(y_true, y_pred, k=None, relevance=<function multilabel>, weights=1.0)

Compute Normalized Discounted Cumulative Gain score(s) based on relevance judgements provided.

Parameters
y_trueiterable or ndarray of shape (n_samples, n_labels) or simply

(n_labels,). The last dimension of the parameter is used as position.

y_prediterable, ndarray of shape (n_samples, n_labels)

Target labels sorted by relevance (as returned by an IR system).

kint, default=None

Only consider the highest k scores in the ranking. If None, use all outputs.

weightsfloat, iterable, ndarray, default=1.0

Represents the weights of each sample.

relevancecallable, default=topk.relevance.multilabel

A function that calculates relevance judgements based on input y_pred and y_true.

Returns
ndcgnp.array

The discounted cumulative gains for samples (or a single sample).

References

Wikipedia entry for normalized discounted cumulative gain

Examples

>>> from irmetrics.topk import ndcg

for ground-truth labels related to a query:

>>> y_true = [1, 2]
>>> y_pred = [0, 1, 0, 0]
>>> ndcg(y_true, y_pred)
0.6309297535714575
>>> # the order of y_true labels doesn't matter
>>> y_true = [2, 1]
>>> y_pred = [0, 1, 0, 0]
>>> ndcg(y_true, y_pred)
0.6309297535714575
irmetrics.topk.precision(y_true, y_pred=None, k=None, relevance=<function multilabel>)

Compute Recall(s). and 1 otherwise. Check which fraction of y_pred is in y_true. NB: When passing y_pred of shape [n_samples, n_outputs] the result is quivalent to recall(y_pred, y_true) / n_outputs.

Parameters
y_truescalar, iterable or ndarray of shape (n_samples, n_labels)

True labels of entities to be ranked. In case of scalars y_pred should be of shape (1, n_labels).

y_prediterable, ndarray of shape (n_samples, n_labels)

Target labels sorted by relevance (as returned by an IR system).

kint, default=None

Only consider the highest k scores in the ranking. If None, use all outputs.

relevancecallable, default=topk.relevance.multilabel

A function that calculates relevance judgements based on input y_pred and y_true.

Returns
rrbool in [True, False]

The relevances for all samples.

References

Wikipedia entry for precision and recall

Examples

>>> from irmetrics.topk import recall

for ground-truth labels related to a query:

>>> y_true = 1

and the predicted labels by an IR system:

>>> y_pred = [0, 1, 4, 3]
>>> precision(y_true, y_pred)
0.25
irmetrics.topk.recall(y_true, y_pred=None, k=None, relevance=<function multilabel>)

Compute Recall(s). Check if at least one metric proposed in y_pred is in y_true. This is the binary score, 0 – all predictionss are irrelevant and 1 otherwise. This definition of recall is equivalent to accuracy@k.

Parameters
y_truescalar, iterable or ndarray of shape (n_samples, n_labels)

True labels of entities to be ranked. In case of scalars y_pred should be of shape (1, n_labels).

y_prediterable, ndarray of shape (n_samples, n_labels)

Target labels sorted by relevance (as returned by an IR system).

kint, default=None

Only consider the highest k scores in the ranking. If None, use all outputs.

relevancecallable, default=topk.relevance.multilabel

A function that calculates relevance judgements based on input y_pred and y_true.

Returns
rrbool in [True, False]

The relevances for all samples.

References

Wikipedia entry for precision and recall

Examples

>>> from irmetrics.topk import recall

for ground-truth labels related to a query:

>>> y_true = 1

and the predicted labels by an IR system:

>>> y_pred = [0, 1, 4]
>>> recall(y_true, y_pred)
1.0
irmetrics.topk.rr(y_true, y_pred, k=None, relevance=<function multilabel>)

Compute Recirocal Rank(s). Calculate the recirocal of the index for the first matched item in y_pred. The score is between 0 and 1.

This ranking metric yields a high value if true labels are ranked high by y_pred.

Parameters
y_truescalar, iterable or ndarray of shape (n_samples, n_labels)

True labels of entities to be ranked. In case of scalars y_pred should be of shape (1, n_labels).

y_prediterable, ndarray of shape (n_samples, n_labels)

Target labels sorted by relevance (as returned by an IR system).

kint, default=None

Only consider the highest k scores in the ranking. If None, use all outputs.

relevancecallable, default=topk.relevance.multilabel

A function that calculates relevance judgements based on input y_pred and y_true.

Returns
rrfloat in [0., 1.]

The recirocal ranks for all samples.

References

Wikipedia entry for Mean reciprocal rank

Examples

>>> from irmetrics.topk import rr
>>> y_true = 1

and the predicted labels by an IR system:

>>> y_pred = [0, 1, 4]
>>> rr(y_true, y_pred)
0.5
irmetrics.coverage.coverage(y_pred, padding=None)

Compute Coverage(s) Check if y_pred contains any nontrivial results.

Parameters
y_prediterable, ndarray of shape (n_samples, n_labels)

Target labels sorted by relevance (as returned by an IR system).

paddingscalar, str, default=None

The value that was used to pad the predictions to get the same length.

Returns
coverageint in [0, 1]
The coverage is 1 if y_pred contains any results different from

padding and 0 otherwise.

Examples

>>> from irmetrics.topk import rr

for gound-truth labels related to some query

>>> y_true = 1

and the predicted labels by an IR system:

>>> y_pred = [0, 1, 4]
>>> coverage(y_true)
1
>>> y_pred = [0, None]
>>> coverage(y_true)
1
>>> coverage([-1], padding=-1)
0
irmetrics.coverage.iou(y_true, y_pred, k=None, relevance=<function multilabel>, n_uniq=<function relevant_counts>)

Compute the approximate version of Intersection over Union. The approximation comes in assumption that y_true and y_pred contain only unique values.

Parameters
y_truescalar, iterable or ndarray of shape (n_samples, n_labels)

True labels of entities to be ranked. In case of scalars y_pred should be of shape (1, n_labels).

y_prediterable, ndarray of shape (n_samples, n_labels)

Target labels sorted by relevance (as returned by an IR system).

kint, default=None

Has no effect provided only for api compatibility.

relevancecallable, default=topk.relevance.multilabel

A function that calculates relevance judgements based on input y_pred and y_true.

n_uniqcallable, default=topk.relevance.relevant_counts

A function that calculates number of unique labels per query.

Returns
ioufloat in [0., 1.]

The ratio of relevant retrieved entries to the union of relevant and retrieved entries.

References

Wikipedia entry for Jaccard Index

Examples

>>> from irmetrics.topk import rr

for ground-truth labels related to a query:

>>> y_true = 1

and the predicted labels by an IR system:

>>> y_pred = [0, 1, 4]
>>> iou(y_true, y_pred)
0.3333333333333333

Utilities

irmetrics.relevance.multilabel(y_true, y_pred)

Compute relevance(s) of predicted labels.

Parameters
y_truendarray of shape (n_samples, n_true), where n_samples >= 1

Ground true labels for a given query (as returned by an IR system).

y_predndarray of shape (n_samples, n_labels), where n_samples >= 1

Target labels sorted by relevance (as returned by an IR system). The n_labels and n_true may not be the same.

Returns
relevancebolean ndarray

The relevance judgements for y_pred of shape (n_samples, n_labels)

Examples

>>> import numpy as np
>>> from irmetrics.relevance import multilabel
>>> # ground-truth label of some answers to a query:
>>> y_true = np.array([[1]]) # (1, 1)

and the predicted labels by an IR system:

>>> y_pred = np.array([[0, 1, 4]]) # (1, 3)
>>> multilabel(y_true, y_pred)
array([[False,  True, False]])
>>> y_true = np.array([[1], [2]]) # (2, 1)
>>> y_pred = np.array([[0, 1, 4], [5, 6, 7]]) # (2, 3)
>>> multilabel(y_true, y_pred)
array([[False,  True, False],
       [False, False, False]])
>>> # Now the multilabel case:
>>> y_true = np.array([[1, 4]]) # (1, 2)
>>> y_pred = np.array([[0, 1, 4]]) # (1, 3)
>>> multilabel(y_true, y_pred)
array([[False,  True,  True]])
irmetrics.relevance.relevant_counts(y_pred, y_true)

Calculate the total number of relevant items.

Parameters
y_truendarray of shape (n_samples, n_true), where n_samples >= 1

Ground true labels for a given query (as returned by an IR system).

y_predndarray of shape (n_samples, n_labels), where n_samples >= 1

Target labels sorted by relevance (as returned by an IR system). The n_labels and n_true may not be the same.

Returns
relevance_counts: ndarray

The number of true relevance judgements for y_pred.

Examples

>>> import numpy as np
>>> from irmetrics.relevance import relevant_counts
>>> # ground-truth label of some answers to a query:
>>> y_true = np.array([[1]]) # (1, 1)

and the predicted labels by an IR system:

>>> y_pred = np.array([[0, 1, 4]]) # (1, 3)
>>> relevant_counts(y_true, y_pred)
array([[1]])
>>> y_true = np.array([[1], [2]]) # (2, 1)
>>> y_pred = np.array([[0, 1, 4], [5, 6, 7]]) # (2, 3)
>>> relevant_counts(y_true, y_pred)
array([[1],
       [1]])
>>> # Now the `relevant_counts` case:
>>> y_true = np.array([[1, 4]]) # (1, 2)
>>> y_pred = np.array([[0, 1, 4]]) # (1, 3)
>>> relevant_counts(y_true, y_pred)
array([[1, 1]])
irmetrics.relevance.unilabel(y_true, y_pred)

Compute relevance(s) of predicted labels. This version of the relevance function works only for the queries (problems) with a single groud truth label.

It is provided mainly for two reasons: there is a slight speedup (order of seconds for the large n_samples) and it adds expresivity if needed.

Parameters
y_truendarray of shape (n_samples, 1), where n_samples >= 1

Ground true labels for a given query (as returned by an IR system).

y_predndarray of shape (n_samples, n_labels), where n_samples >= 1

Target labels sorted by relevance (as returned by an IR system).

Returns
relevancebolean ndarray

The relevance judgements for y_pred of shape (n_samples, 1)

Raises
ValueError

If y_true has last dimension larger than 1 (multilabel case).

Examples

>>> import numpy as np
>>> from irmetrics.relevance import unilabel
>>> # ground-truth label of some answers to a query:
>>> y_true = np.array([[1]]) # (1, 1)

and the predicted labels by an IR system:

>>> y_pred = np.array([[0, 1, 4]]) # (1, 3)
>>> unilabel(y_true, y_pred)
array([[False,  True, False]])
>>> y_true = np.array([[1], [2]]) # (2, 1)
>>> y_pred = np.array([[0, 1, 4], [5, 6, 7]]) # (2, 3)
>>> unilabel(y_true, y_pred)
array([[False,  True, False],
       [False, False, False]])
irmetrics.flat.flat(df, query_col, relevance_col, measure, k=None)

Calculate the corresponding measure for the data in flat format, with precalculated relevance judgements:

query_col

relevance_col

weights_col

1 1 1 2 2 2 2

0 1 0 0 1 1 1

1.0 2.0 3.0 4.0 5.0 6.0 7.0

Parameters
dfpandas.DataFrame

Dataset in the flat form: each row corresponds to a sample with the given query_id and relevance judgement (higher is better).

query_colstr

The column that corresponds to query identificator.

relevance_colstr

The column that corresponds to relevance judgements.

measurecallable

The desired measure to be calculated (one from irmetrics.topk). Currently, only topk.ndcg and topk.rr are supported.

kint, default=None

Only consider the highest k scores in the ranking. If None, use all outputs.

Returns
measurespandas.core.series.Series

The values of the corresponding measure calculated per each query.

Examples

>>> import pandas as pd
>>> from irmetrics.topk import rr
>>> from irmetrics.flat import flat
>>> df = pd.DataFrame({"quid": [1, 1, 2, 2], "rel": [1, 0, 0, 1]})
>>> flat(df, query_col="quid", relevance_col="rel", measure=rr)
quid
1    1.0
2    0.5
Name: rel, dtype: float64