API¶

This this is the full API reference of all public methods.

Metrics¶

irmetrics.topk.ap(y_true, y_pred, k=None, relevance=<function multilabel>)¶

Compute Average Precision score(s). AP is an aproximation of the integral over PR-curve.

Parameters

y_truescalar, iterable or ndarray of shape (n_samples, n_labels): True labels of entities to be ranked. In case of scalars y_pred should be of shape (1, n_labels).
y_prediterable, ndarray of shape (n_samples, n_labels): Target labels sorted by relevance (as returned by an IR system).
kint, default=None: Only consider the highest k scores in the ranking. If None, use all outputs. The minimum between the nuber of correct answers and k will be used to compute the score.
relevancecallable, default=topk.relevance.multilabel: A function that calculates relevance judgements based on input y_pred and y_true.

Returns

apfloat: The average precision for a given sample.

References

Wikipedia entry for Mean Average Precision

Examples

>>> from irmetrics.topk import ap

for ground-truth labels related to a query:

>>> y_true = 1

and the predicted labels by an IR system:

>>> y_pred = [1, 0, 0]
>>> ap(y_true, y_pred)
0.3333333333333333
>>> # This should be fixed
>>> y_true = [1, 4, 5]

and the predicted labels by an IR system:

>>> y_pred = [1, 2, 3, 4, 5]
>>> ap(y_true, y_pred)
array([0.2, 0. , 0. ])

irmetrics.topk.dcg_score(relevance, k=None, weights=1.0)¶

Compute Discounted Cumulative Gain score(s) based on relevance judgements provided.

This is provided as internal implementation for ndcg for this reason the API for this function slightly differ: it alawyas accepts and outputs np.arrays, unlike other methos in this module.

Parameters

relevanceiterable or ndarray of shape (n_samples, n_labels) or simply: (n_labels,). The last dimension of the parameter is used as position. The relevance judgements provided by experts.
weightsdefault=1.0, scalar, iterable or ndarray of shape (n_samples,): takes into account the importance of each sample, if relevant.
kint, default=None: Only consider the highest k scores in the ranking. If None, use all outputs.
relevancecallable, default=topk.relevance.multilabel: A function that calculates relevance judgements based on input y_pred and y_true.

Returns

dcgnp.array: The discounted cumulative gains for samples (or a single sample).

References

Wikipedia entry for Discounted cumulative gain

Examples

>>> from irmetrics.topk import dcg_score

for ground-truth labels related to a query:

>>> relevance_judgements = np.array([[1, 0, 0, 0]])
>>> dcg_score(relevance_judgements)
array([1.])
>>> relevance_judgements = np.array([[True, False, False, False]])
>>> dcg_score(relevance_judgements)
array([1.])
>>> relevance_judgements = np.array([[False, True, False, False]])
>>> dcg_score(relevance_judgements)
array([0.63092975])

irmetrics.topk.ndcg(y_true, y_pred, k=None, relevance=<function multilabel>, weights=1.0)¶

Compute Normalized Discounted Cumulative Gain score(s) based on relevance judgements provided.

Parameters

y_trueiterable or ndarray of shape (n_samples, n_labels) or simply: (n_labels,). The last dimension of the parameter is used as position.
y_prediterable, ndarray of shape (n_samples, n_labels): Target labels sorted by relevance (as returned by an IR system).
kint, default=None: Only consider the highest k scores in the ranking. If None, use all outputs.
weightsfloat, iterable, ndarray, default=1.0: Represents the weights of each sample.
relevancecallable, default=topk.relevance.multilabel: A function that calculates relevance judgements based on input y_pred and y_true.

Returns

ndcgnp.array: The discounted cumulative gains for samples (or a single sample).

References

Wikipedia entry for normalized discounted cumulative gain

Examples

>>> from irmetrics.topk import ndcg

for ground-truth labels related to a query:

>>> y_true = [1, 2]
>>> y_pred = [0, 1, 0, 0]
>>> ndcg(y_true, y_pred)
0.6309297535714575
>>> # the order of y_true labels doesn't matter
>>> y_true = [2, 1]
>>> y_pred = [0, 1, 0, 0]
>>> ndcg(y_true, y_pred)
0.6309297535714575

irmetrics.topk.precision(y_true, y_pred=None, k=None, relevance=<function multilabel>)¶

Compute Recall(s). and 1 otherwise. Check which fraction of y_pred is in y_true. NB: When passing y_pred of shape [n_samples, n_outputs] the result is quivalent to recall(y_pred, y_true) / n_outputs.

Parameters

y_truescalar, iterable or ndarray of shape (n_samples, n_labels): True labels of entities to be ranked. In case of scalars y_pred should be of shape (1, n_labels).
y_prediterable, ndarray of shape (n_samples, n_labels): Target labels sorted by relevance (as returned by an IR system).
kint, default=None: Only consider the highest k scores in the ranking. If None, use all outputs.
relevancecallable, default=topk.relevance.multilabel: A function that calculates relevance judgements based on input y_pred and y_true.

Returns

rrbool in [True, False]: The relevances for all samples.

References

Wikipedia entry for precision and recall

Examples

>>> from irmetrics.topk import recall

for ground-truth labels related to a query:

>>> y_true = 1

and the predicted labels by an IR system:

>>> y_pred = [0, 1, 4, 3]
>>> precision(y_true, y_pred)
0.25

irmetrics.topk.recall(y_true, y_pred=None, k=None, relevance=<function multilabel>)¶

Compute Recall(s). Check if at least one metric proposed in y_pred is in y_true. This is the binary score, 0 – all predictionss are irrelevant and 1 otherwise. This definition of recall is equivalent to accuracy@k.

Parameters

y_truescalar, iterable or ndarray of shape (n_samples, n_labels): True labels of entities to be ranked. In case of scalars y_pred should be of shape (1, n_labels).
y_prediterable, ndarray of shape (n_samples, n_labels): Target labels sorted by relevance (as returned by an IR system).
kint, default=None: Only consider the highest k scores in the ranking. If None, use all outputs.
relevancecallable, default=topk.relevance.multilabel: A function that calculates relevance judgements based on input y_pred and y_true.

Returns

rrbool in [True, False]: The relevances for all samples.

References

Wikipedia entry for precision and recall

Examples

>>> from irmetrics.topk import recall

for ground-truth labels related to a query:

>>> y_true = 1

and the predicted labels by an IR system:

>>> y_pred = [0, 1, 4]
>>> recall(y_true, y_pred)
1.0

irmetrics.topk.rr(y_true, y_pred, k=None, relevance=<function multilabel>)¶

Compute Recirocal Rank(s). Calculate the recirocal of the index for the first matched item in y_pred. The score is between 0 and 1.

This ranking metric yields a high value if true labels are ranked high by y_pred.

Parameters

y_truescalar, iterable or ndarray of shape (n_samples, n_labels): True labels of entities to be ranked. In case of scalars y_pred should be of shape (1, n_labels).
y_prediterable, ndarray of shape (n_samples, n_labels): Target labels sorted by relevance (as returned by an IR system).
kint, default=None: Only consider the highest k scores in the ranking. If None, use all outputs.
relevancecallable, default=topk.relevance.multilabel: A function that calculates relevance judgements based on input y_pred and y_true.

Returns

rrfloat in [0., 1.]: The recirocal ranks for all samples.

References

Wikipedia entry for Mean reciprocal rank

Examples

>>> from irmetrics.topk import rr
>>> y_true = 1

and the predicted labels by an IR system:

>>> y_pred = [0, 1, 4]
>>> rr(y_true, y_pred)
0.5

irmetrics.coverage.coverage(y_pred, padding=None)¶

Compute Coverage(s) Check if y_pred contains any nontrivial results.

Parameters

y_prediterable, ndarray of shape (n_samples, n_labels): Target labels sorted by relevance (as returned by an IR system).
paddingscalar, str, default=None: The value that was used to pad the predictions to get the same length.

Returns

coverageint in [0, 1]

The coverage is 1 if y_pred contains any results different from: padding and 0 otherwise.

Examples

>>> from irmetrics.topk import rr

for gound-truth labels related to some query

>>> y_true = 1

and the predicted labels by an IR system:

>>> y_pred = [0, 1, 4]
>>> coverage(y_true)
1
>>> y_pred = [0, None]
>>> coverage(y_true)
1
>>> coverage([-1], padding=-1)
0

irmetrics.coverage.iou(y_true, y_pred, k=None, relevance=<function multilabel>, n_uniq=<function relevant_counts>)¶

Compute the approximate version of Intersection over Union. The approximation comes in assumption that y_true and y_pred contain only unique values.

Parameters

y_truescalar, iterable or ndarray of shape (n_samples, n_labels): True labels of entities to be ranked. In case of scalars y_pred should be of shape (1, n_labels).
y_prediterable, ndarray of shape (n_samples, n_labels): Target labels sorted by relevance (as returned by an IR system).
kint, default=None: Has no effect provided only for api compatibility.
relevancecallable, default=topk.relevance.multilabel: A function that calculates relevance judgements based on input y_pred and y_true.
n_uniqcallable, default=topk.relevance.relevant_counts: A function that calculates number of unique labels per query.

Returns

ioufloat in [0., 1.]: The ratio of relevant retrieved entries to the union of relevant and retrieved entries.

References

Wikipedia entry for Jaccard Index

Examples

>>> from irmetrics.topk import rr

for ground-truth labels related to a query:

>>> y_true = 1

and the predicted labels by an IR system:

>>> y_pred = [0, 1, 4]
>>> iou(y_true, y_pred)
0.3333333333333333

Utilities¶

irmetrics.relevance.multilabel(y_true, y_pred)¶

Compute relevance(s) of predicted labels.

Parameters

y_truendarray of shape (n_samples, n_true), where n_samples >= 1: Ground true labels for a given query (as returned by an IR system).
y_predndarray of shape (n_samples, n_labels), where n_samples >= 1: Target labels sorted by relevance (as returned by an IR system). The n_labels and n_true may not be the same.

Returns

relevancebolean ndarray: The relevance judgements for y_pred of shape (n_samples, n_labels)

Examples

>>> import numpy as np
>>> from irmetrics.relevance import multilabel
>>> # ground-truth label of some answers to a query:
>>> y_true = np.array([[1]]) # (1, 1)

and the predicted labels by an IR system:

>>> y_pred = np.array([[0, 1, 4]]) # (1, 3)
>>> multilabel(y_true, y_pred)
array([[False,  True, False]])
>>> y_true = np.array([[1], [2]]) # (2, 1)
>>> y_pred = np.array([[0, 1, 4], [5, 6, 7]]) # (2, 3)
>>> multilabel(y_true, y_pred)
array([[False,  True, False],
       [False, False, False]])
>>> # Now the multilabel case:
>>> y_true = np.array([[1, 4]]) # (1, 2)
>>> y_pred = np.array([[0, 1, 4]]) # (1, 3)
>>> multilabel(y_true, y_pred)
array([[False,  True,  True]])

irmetrics.relevance.relevant_counts(y_pred, y_true)¶

Calculate the total number of relevant items.

Parameters

y_truendarray of shape (n_samples, n_true), where n_samples >= 1: Ground true labels for a given query (as returned by an IR system).
y_predndarray of shape (n_samples, n_labels), where n_samples >= 1: Target labels sorted by relevance (as returned by an IR system). The n_labels and n_true may not be the same.

Returns

relevance_counts: ndarray: The number of true relevance judgements for y_pred.

Examples

>>> import numpy as np
>>> from irmetrics.relevance import relevant_counts
>>> # ground-truth label of some answers to a query:
>>> y_true = np.array([[1]]) # (1, 1)

and the predicted labels by an IR system:

>>> y_pred = np.array([[0, 1, 4]]) # (1, 3)
>>> relevant_counts(y_true, y_pred)
array([[1]])
>>> y_true = np.array([[1], [2]]) # (2, 1)
>>> y_pred = np.array([[0, 1, 4], [5, 6, 7]]) # (2, 3)
>>> relevant_counts(y_true, y_pred)
array([[1],
       [1]])
>>> # Now the `relevant_counts` case:
>>> y_true = np.array([[1, 4]]) # (1, 2)
>>> y_pred = np.array([[0, 1, 4]]) # (1, 3)
>>> relevant_counts(y_true, y_pred)
array([[1, 1]])

irmetrics.relevance.unilabel(y_true, y_pred)¶

Compute relevance(s) of predicted labels. This version of the relevance function works only for the queries (problems) with a single groud truth label.

It is provided mainly for two reasons: there is a slight speedup (order of seconds for the large n_samples) and it adds expresivity if needed.

Parameters

y_truendarray of shape (n_samples, 1), where n_samples >= 1: Ground true labels for a given query (as returned by an IR system).
y_predndarray of shape (n_samples, n_labels), where n_samples >= 1: Target labels sorted by relevance (as returned by an IR system).

Returns

relevancebolean ndarray: The relevance judgements for y_pred of shape (n_samples, 1)

Raises

ValueError: If y_true has last dimension larger than 1 (multilabel case).

Examples

>>> import numpy as np
>>> from irmetrics.relevance import unilabel
>>> # ground-truth label of some answers to a query:
>>> y_true = np.array([[1]]) # (1, 1)

and the predicted labels by an IR system:

>>> y_pred = np.array([[0, 1, 4]]) # (1, 3)
>>> unilabel(y_true, y_pred)
array([[False,  True, False]])
>>> y_true = np.array([[1], [2]]) # (2, 1)
>>> y_pred = np.array([[0, 1, 4], [5, 6, 7]]) # (2, 3)
>>> unilabel(y_true, y_pred)
array([[False,  True, False],
       [False, False, False]])

irmetrics.flat.flat(df, query_col, relevance_col, measure, k=None)¶

Calculate the corresponding measure for the data in flat format, with precalculated relevance judgements:

query_col	relevance_col	weights_col
1 1 1 2 2 2 2	0 1 0 0 1 1 1	1.0 2.0 3.0 4.0 5.0 6.0 7.0

Parameters

dfpandas.DataFrame: Dataset in the flat form: each row corresponds to a sample with the given query_id and relevance judgement (higher is better).
query_colstr: The column that corresponds to query identificator.
relevance_colstr: The column that corresponds to relevance judgements.
measurecallable: The desired measure to be calculated (one from irmetrics.topk). Currently, only topk.ndcg and topk.rr are supported.
kint, default=None: Only consider the highest k scores in the ranking. If None, use all outputs.

Returns

measurespandas.core.series.Series: The values of the corresponding measure calculated per each query.

Examples

>>> import pandas as pd
>>> from irmetrics.topk import rr
>>> from irmetrics.flat import flat
>>> df = pd.DataFrame({"quid": [1, 1, 2, 2], "rel": [1, 0, 0, 1]})
>>> flat(df, query_col="quid", relevance_col="rel", measure=rr)
quid
1    1.0
2    0.5
Name: rel, dtype: float64

API¶

Metrics¶

Utilities¶

Contents

Navigation