# Machine Translation

# seq2seq

# Bert

Bert uses exact the same structure as transformers(proposed in attention is all you need), but is pre-trained in a novel way to facilitate bidirectional learning

Besides positional embedding and word embedding, bert proposes a token_type embedding(like segment ids) to extend for other tasks

# Word embedding

# latent semantic analysis

SVD, 全局特征的矩阵分解方法

# word2vec

局部上下文

center and context words. skip-ngram: use center to predict context with neural networks

  • [Reading](https://people.eng.unimelb.edu.au/mbouadjenek/papers/wordembed.pdf
  • Negative sampling
    • softmax on very large context set is time-consuming
    • noise contrastive estimation converts to binary classification
    • negative sampling to further simply

# GloVe

既使用了语料库的全局统计(overall statistics)特征,也使用了局部的上下文特征(即滑动窗口)

# Bert embedding

context aware

# Tokenizer

# Byte Pair Encoding

Formed by character n-gram, e.g., ana, app. As in unicode, there are 149k+ characters, hence it requires large number of vocabulary to cover mostly used n-grams. And it is very common to see [UNKNOWN] in model's input.

Details in huggingface tokenizer summary

# Byte Level Encoding

A variation of BPE. Use Byte as base unit to form n-gram. Only 256 possible choices for base unit, so it covers mostly used n-grams by only 50k+. 聚沙成塔:关于tokenization(词元化)的解疑释惑

# WordPiece

Given a training corpus and expected vocabulary size D, find the optimal vocabulary list with size D which produces minimal number of tokens after tokenization with it.

# Bottom up

Bottom up: described poorly in Japanese and Korean Voice Search. Similar with Byte-Paired Encoding, except that it merge small tokens which increase the likelihood on the training data the most.

In the paper, it is confusing:

they need to "maximize the likelihood" of each pair, by building a new Language Model (LM) each step - they don't say what constitutes a LM

In huggingface, it is interpreted as:

So what does this mean exactly? Referring to the previous example, maximizing the likelihood of the training data is equivalent to finding the symbol pair, whose probability divided by the probabilities of its first symbol followed by its second symbol is the greatest among all symbol pairs. E.g. "u", followed by "g" would have only been merged if the probability of "ug" divided by "u", "g" would have been greater than for any other symbol pair. Intuitively, WordPiece is slightly different to BPE in that it evaluates what it loses by merging two symbols to ensure it’s worth it.

In zhihu, it is interpreted as:

合并比如决策树,在某个节点拆分前,会考虑到拆分前和拆分后的信息增益,比如我们选择了x这个特征进行拆分,那么在拆分前,其信息熵为 log(p(y)-log(p(y), 拆分后为: log(p(y1))log(p(y2))-log(p(y1)) - log(p(y2)) , 也就是说,拆分前后的信息增益为:log(p(y1))log(p(y2))+log(p(y))-log(p(y1)) - log(p(y2)) + log(p(y)), 整合后的结果是: log(p(y)p(y1)p(y2))\log \left(\frac{p(y)}{p(y 1) p(y 2)}\right), 也就是说"e"和"s"的合并考虑的是概率上的效果提升,而不是单纯的频次

# Top down

Used by bert. Starting with words and breaking them down into smaller components until they hit the frequency threshold, or can't be broken down further. For Japanese, Chinese and Korean this top-down approach doesn't work since there are no explicit word units to start with.

Details in tensorflow text library

# Evaluation

# Similarity

way 1 sklearn.feature_extraction.text.CountVectorizer: tokenize corpus first, and get a count matrix for prediction and ground truth respectively which can be used to compute the matrix similarity score using cosine later

from sklearn.feature_extraction.text import CountVectorizer
import scipy

# supports both char and word
vec = CountVectorizer(ngram_range=(1, 2), analyzer='char')

vec.fit(['sentence 1', 'sentence 2'])

m1 = vec.transform(prediction).todense()
m2 = vec.transform(truth).todense()


def score_calc(v1, v2):
    score = scipy.spatial.distance.cosine(v1, v2)
    score = 0 if pd.isna(score) else 1 - score
    return score

tmp_df = pd.concat([m1, m2], axis=1)
scores = tmp_df.apply(lambda x: score_calc(x[0], x[1]), axis=1)

way 2 Levenshtein distance (number of additions, substitutions, etc.)

from fuzzywuzzy import fuzz

Str1 = "Apple Inc."
Str2 = "apple Inc"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Last Updated: 3/27/2023, 4:00:33 PM