# Machine Translation

# seq2seq

# Bert

Bert uses exact the same structure as transformers(proposed in attention is all you need), but is pre-trained in a novel way to facilitate bidirectional learning

Besides positional embedding and word embedding, bert proposes a token_type embedding(like segment ids) to extend for other tasks

Transformer1
- multi-head self-attention, relation to other tokens: $q_{token} * k_{other\_token} = score_{other\_token}$ ([batch x from_seq x heads x x embedding_head_size] * [batch x to_seq x heads x embedding_head_size] => [batch x heads x from_seq x to_seq])
- A [PAD] as input, [PAD] score to A is 0, but A to [PAD] is 0.5. Therefore [PAD] attends to A will not update embedding weights for [PAD], but A attends to[PAD]will update embedding weights for[PAD]`
- positional embedding can be fixed with sinusoidal or a learnable variable
- what is sinusoidal
Transformer2
Layer Norm
GeLU
Pre-training
- Given [MASK], module is aware that he needs to predict it
- 15% of input tokens are masked by 0.8 [MASK], 0.1 random and different token and 0.1 original token. Reason
- The output of the [CLS] token is transformed into a 2×1 shaped vector, using a simple classification layer (learned matrices of weights and biases)
- bert is pre-trained on next-sentence classification, so segment ids is needed for bert pre-training
- WordPiece tokenizer, ['hacking'] => ['hack', '##ing']
- embedding table => w
- segment, position => b
  - learned position embedding, used by bert
  - sinusoidal position, this one is fixed but not used by bert
  - segment ids [0, 0, 1] are extend to [seq_len(3), embedding_size] with random weights, all 0' weights are the same and distinct from 1's
  - more info
- the embedding process is like a fc layer, if input => one hot encoding (inefficient, thus using embedding lookup)
- only attention mask is needed to avoid [PAD] token affecting weight variables being updated in back-propagation
Usage
- classification with feature extraction(position and segment are ignored for simple understanding)
- classification with fine tuning (position and segment are ignored for simple understanding)
Code
Parameter calculation 1, Parameter calculation 2

# Word embedding

# latent semantic analysis

SVD，全局特征的矩阵分解方法

# word2vec

局部上下文

center and context words. skip-ngram: use center to predict context with neural networks

[Reading](https://people.eng.unimelb.edu.au/mbouadjenek/papers/wordembed.pdf
Negative sampling
- softmax on very large context set is time-consuming
- noise contrastive estimation converts to binary classification
- negative sampling to further simply

# GloVe

既使用了语料库的全局统计（overall statistics）特征，也使用了局部的上下文特征（即滑动窗口）

# Bert embedding

context aware

# Tokenizer

# Byte Pair Encoding

Formed by character n-gram, e.g., ana, app. As in unicode, there are 149k+ characters, hence it requires large number of vocabulary to cover mostly used n-grams. And it is very common to see [UNKNOWN] in model's input.

Details in huggingface tokenizer summary

# Byte Level Encoding

A variation of BPE. Use Byte as base unit to form n-gram. Only 256 possible choices for base unit, so it covers mostly used n-grams by only 50k+. 聚沙成塔：关于tokenization（词元化）的解疑释惑

# WordPiece

Given a training corpus and expected vocabulary size D, find the optimal vocabulary list with size D which produces minimal number of tokens after tokenization with it.

# Bottom up

Bottom up: described poorly in Japanese and Korean Voice Search. Similar with Byte-Paired Encoding, except that it merge small tokens which increase the likelihood on the training data the most.

In the paper, it is confusing:

they need to "maximize the likelihood" of each pair, by building a new Language Model (LM) each step - they don't say what constitutes a LM

In huggingface, it is interpreted as:

So what does this mean exactly? Referring to the previous example, maximizing the likelihood of the training data is equivalent to finding the symbol pair, whose probability divided by the probabilities of its first symbol followed by its second symbol is the greatest among all symbol pairs. E.g. "u", followed by "g" would have only been merged if the probability of "ug" divided by "u", "g" would have been greater than for any other symbol pair. Intuitively, WordPiece is slightly different to BPE in that it evaluates what it loses by merging two symbols to ensure it’s worth it.

In zhihu, it is interpreted as:

合并比如决策树，在某个节点拆分前，会考虑到拆分前和拆分后的信息增益，比如我们选择了x这个特征进行拆分，那么在拆分前，其信息熵为 $-log(p(y)$ , 拆分后为: $-log(p(y1)) - log(p(y2))$ , 也就是说，拆分前后的信息增益为： $-log(p(y1)) - log(p(y2)) + log(p(y))$ ，整合后的结果是： $\log \left(\frac{p(y)}{p(y 1) p(y 2)}\right)$ , 也就是说"e"和"s"的合并考虑的是概率上的效果提升，而不是单纯的频次

# Top down

Used by bert. Starting with words and breaking them down into smaller components until they hit the frequency threshold, or can't be broken down further. For Japanese, Chinese and Korean this top-down approach doesn't work since there are no explicit word units to start with.

Details in tensorflow text library

# Evaluation

# Similarity

way 1 sklearn.feature_extraction.text.CountVectorizer: tokenize corpus first, and get a count matrix for prediction and ground truth respectively which can be used to compute the matrix similarity score using cosine later

from sklearn.feature_extraction.text import CountVectorizer
import scipy

# supports both char and word
vec = CountVectorizer(ngram_range=(1, 2), analyzer='char')

vec.fit(['sentence 1', 'sentence 2'])

m1 = vec.transform(prediction).todense()
m2 = vec.transform(truth).todense()


def score_calc(v1, v2):
    score = scipy.spatial.distance.cosine(v1, v2)
    score = 0 if pd.isna(score) else 1 - score
    return score

tmp_df = pd.concat([m1, m2], axis=1)
scores = tmp_df.apply(lambda x: score_calc(x[0], x[1]), axis=1)

way 2 Levenshtein distance (number of additions, substitutions, etc.)

from fuzzywuzzy import fuzz

Str1 = "Apple Inc."
Str2 = "apple Inc"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())

references

← Image Classification Table Recognition →