bilingual evaluation understudy 通过 n-gram 文法精确度评估相似度
$$ \textrm{BLEU} = \exp\left(\min(0, 1 - \frac{\text{len}\text{label}}{\text{len}\text{pred}})\right)\prod_{n=1}^k p_n^{w_n} $$
前面一项对短译文进行惩罚,其中 $\text{len}_\text{pred}$ 为最短参考译文的长度;后面一项 $p_n$ 表示 n-gram 的精确度,一般 $k = 4, w_n=\frac14$
对于机器译文 $y^*$,参考译文 $y_1, \dots , y_n$,
$$ p_n = \frac{\sum_{\text{ngram}\in y_}\text{COUNT}\text{match}(\text{ngram})}{\sum{\text{ngram}\in y_}\text{COUNT}(\text{ngram})} $$
只有 n-gram 的精确度的话,译文越短,值倾向于更大,对于漏翻的词不敏感
缺点:无法从语义角度度量翻译质量
TODO
def bleu(pred_seq, label_seq, k):
pred_tokens, label_tokens = pred_seq.split(' '), label_seq.split(' ')
len_pred, len_label = len(pred_tokens), len(label_tokens)
score = math.exp(min(0, 1 - len_label / len_pred))
for n in range(1, k + 1):
num_matches, label_subs = 0, collections.defaultdict(int)
for i in range(len_label - n + 1):
label_subs[' '.join(label_tokens[i: i + n])] += 1
for i in range(len_pred - n + 1):
if label_subs[' '.join(pred_tokens[i: i + n])] > 0:
num_matches += 1
label_subs[' '.join(pred_tokens[i: i + n])] -= 1
score *= math.pow(num_matches / (len_pred - n + 1), math.pow(0.5, n))
return score
def _get_ngrams(segment, max_order):
"""Extracts all n-grams upto a given maximum order from an input segment.
Args:
segment: text segment from which n-grams will be extracted.
max_order: maximum length in tokens of the n-grams returned by this
methods.
Returns:
The Counter containing all n-grams upto max_order in segment
with a count of how many times each n-gram occurred.
"""
ngram_counts = collections.Counter()
for order in range(1, max_order + 1):
for i in range(0, len(segment) - order + 1):
ngram = tuple(segment[i:i+order])
ngram_counts[ngram] += 1
return ngram_counts
def compute_bleu(reference_corpus, translation_corpus, max_order=4,
smooth=False):
"""Computes BLEU score of translated segments against one or more references.
Args:
reference_corpus: list of lists of references for each translation. Each
reference should be tokenized into a list of tokens.
translation_corpus: list of translations to score. Each translation
should be tokenized into a list of tokens.
max_order: Maximum n-gram order to use when computing BLEU score.
smooth: Whether or not to apply Lin et al. 2004 smoothing.
Returns:
3-Tuple with the BLEU score, n-gram precisions, geometric mean of n-gram
precisions and brevity penalty.
"""
matches_by_order = [0] * max_order
possible_matches_by_order = [0] * max_order
reference_length = 0
translation_length = 0
for (references, translation) in zip(reference_corpus,
translation_corpus):
reference_length += min(len(r) for r in references)
translation_length += len(translation)
merged_ref_ngram_counts = collections.Counter()
for reference in references:
merged_ref_ngram_counts |= _get_ngrams(reference, max_order)
translation_ngram_counts = _get_ngrams(translation, max_order)
overlap = translation_ngram_counts & merged_ref_ngram_counts
for ngram in overlap:
matches_by_order[len(ngram)-1] += overlap[ngram]
for order in range(1, max_order+1):
possible_matches = len(translation) - order + 1
if possible_matches > 0:
possible_matches_by_order[order-1] += possible_matches
precisions = [0] * max_order
for i in range(0, max_order):
if smooth:
precisions[i] = ((matches_by_order[i] + 1.) /
(possible_matches_by_order[i] + 1.))
else:
if possible_matches_by_order[i] > 0:
precisions[i] = (float(matches_by_order[i]) /
possible_matches_by_order[i])
else:
precisions[i] = 0.0
if min(precisions) > 0:
p_log_sum = sum((1. / max_order) * math.log(p) for p in precisions)
geo_mean = math.exp(p_log_sum)
else:
geo_mean = 0
ratio = float(translation_length) / reference_length
if ratio > 1.0:
bp = 1.
else:
bp = math.exp(1 - 1. / ratio)
bleu = geo_mean * bp
return (bleu, precisions, bp, ratio, translation_length, reference_length)
TODO
给定一个真实的句子,通过这个序列中所有的 n 个 token 的交叉熵损失的平均值衡量语言模型
$$ \frac1n\sum_{t=1}^n -\log P(x_t\mid x_{t-1},\dots ,x_1) $$
定义困惑的为该项的指数
$$ \textrm{ppl}(x) = \exp\left(\frac1n\sum_{t=1}^n -\log P(x_t\mid x_{t-1},\dots ,x_1)\right) = \prod_{t=1}^n P(x_t\mid x_{t-1},\dots ,x_1)^{-\frac1n} $$
困惑度越小越好,最好情况下为1,最坏为正无穷