BLEU

bilingual evaluation understudy 通过 n-gram 文法精确度评估相似度

$$ \textrm{BLEU} = \exp\left(\min(0, 1 - \frac{\text{len}\text{label}}{\text{len}\text{pred}})\right)\prod_{n=1}^k p_n^{w_n} $$

前面一项对短译文进行惩罚，其中 $\text{len}_\text{pred}$ 为最短参考译文的长度；后面一项 $p_n$ 表示 n-gram 的精确度，一般 $k = 4, w_n=\frac14$

对于机器译文 $y^*$，参考译文 $y_1, \dots , y_n$，

$$ p_n = \frac{\sum_{\text{ngram}\in y_}\text{COUNT}\text{match}(\text{ngram})}{\sum{\text{ngram}\in y_}\text{COUNT}(\text{ngram})} $$

$\text{COUNT}{\text{match}}(\text{ngram}) = \min(C(\text{ngram}, y^*), \max{i}C(\text{ngram}, y_i))$ 表示 ngram 在候选与参考译文的共现最大次数
$\text{COUNT}(\text{ngram}) = C(\text{ngram}, y^*)$ 表示 ngram 在候选译文的出现次数

只有 n-gram 的精确度的话，译文越短，值倾向于更大，对于漏翻的词不敏感

缺点：无法从语义角度度量翻译质量

TODO

def bleu(pred_seq, label_seq, k):
    pred_tokens, label_tokens = pred_seq.split(' '), label_seq.split(' ')
    len_pred, len_label = len(pred_tokens), len(label_tokens)
    score = math.exp(min(0, 1 - len_label / len_pred))
    for n in range(1, k + 1):
        num_matches, label_subs = 0, collections.defaultdict(int)
        for i in range(len_label - n + 1):
            label_subs[' '.join(label_tokens[i: i + n])] += 1
        for i in range(len_pred - n + 1):
            if label_subs[' '.join(pred_tokens[i: i + n])] > 0:
                num_matches += 1
                label_subs[' '.join(pred_tokens[i: i + n])] -= 1
        score *= math.pow(num_matches / (len_pred - n + 1), math.pow(0.5, n))
    return score

    def _get_ngrams(segment, max_order):
        """Extracts all n-grams upto a given maximum order from an input segment.

        Args:
          segment: text segment from which n-grams will be extracted.
          max_order: maximum length in tokens of the n-grams returned by this
              methods.

        Returns:
          The Counter containing all n-grams upto max_order in segment
          with a count of how many times each n-gram occurred.
        """
        ngram_counts = collections.Counter()
        for order in range(1, max_order + 1):
            for i in range(0, len(segment) - order + 1):
                ngram = tuple(segment[i:i+order])
                ngram_counts[ngram] += 1
        return ngram_counts

def compute_bleu(reference_corpus, translation_corpus, max_order=4,
                 smooth=False):
    """Computes BLEU score of translated segments against one or more references.

    Args:
      reference_corpus: list of lists of references for each translation. Each
          reference should be tokenized into a list of tokens.
      translation_corpus: list of translations to score. Each translation
          should be tokenized into a list of tokens.
      max_order: Maximum n-gram order to use when computing BLEU score.
      smooth: Whether or not to apply Lin et al. 2004 smoothing.

    Returns:
      3-Tuple with the BLEU score, n-gram precisions, geometric mean of n-gram
      precisions and brevity penalty.
    """
    matches_by_order = [0] * max_order
    possible_matches_by_order = [0] * max_order
    reference_length = 0
    translation_length = 0
    for (references, translation) in zip(reference_corpus,
                                         translation_corpus):
        reference_length += min(len(r) for r in references)
        translation_length += len(translation)

        merged_ref_ngram_counts = collections.Counter()
        for reference in references:
            merged_ref_ngram_counts |= _get_ngrams(reference, max_order)
        translation_ngram_counts = _get_ngrams(translation, max_order)
        overlap = translation_ngram_counts & merged_ref_ngram_counts
        for ngram in overlap:
            matches_by_order[len(ngram)-1] += overlap[ngram]
        for order in range(1, max_order+1):
            possible_matches = len(translation) - order + 1
            if possible_matches > 0:
                possible_matches_by_order[order-1] += possible_matches

    precisions = [0] * max_order
    for i in range(0, max_order):
        if smooth:
            precisions[i] = ((matches_by_order[i] + 1.) /
                             (possible_matches_by_order[i] + 1.))
        else:
            if possible_matches_by_order[i] > 0:
                precisions[i] = (float(matches_by_order[i]) /
                                 possible_matches_by_order[i])
            else:
                precisions[i] = 0.0

    if min(precisions) > 0:
        p_log_sum = sum((1. / max_order) * math.log(p) for p in precisions)
        geo_mean = math.exp(p_log_sum)
    else:
        geo_mean = 0

    ratio = float(translation_length) / reference_length

    if ratio > 1.0:
        bp = 1.
    else:
        bp = math.exp(1 - 1. / ratio)

    bleu = geo_mean * bp

    return (bleu, precisions, bp, ratio, translation_length, reference_length)

Rouge

TODO

困惑度

给定一个真实的句子，通过这个序列中所有的 n 个 token 的交叉熵损失的平均值衡量语言模型

$$ \frac1n\sum_{t=1}^n -\log P(x_t\mid x_{t-1},\dots ,x_1) $$

定义困惑的为该项的指数

$$ \textrm{ppl}(x) = \exp\left(\frac1n\sum_{t=1}^n -\log P(x_t\mid x_{t-1},\dots ,x_1)\right) = \prod_{t=1}^n P(x_t\mid x_{t-1},\dots ,x_1)^{-\frac1n} $$

困惑度可以理解为语言模型预测某个序列与真实情况的交叉熵的指数
也可以理解为语言模型预测一个已有的句子的概率的倒数对长度的规范化

困惑度越小越好，最好情况下为1，最坏为正无穷