Daily summary:



A Survey on Contextual Embeddings:

Link: https://arxiv.org/abs/2003.07278

Pretraining can be divided into 2 types:

ELMo can be degraded into BiLSTM?

No, because ELMo is pretrained to be like that.

ELMo is essentially a Bi-LSTM, the representation for each layer is the concatenation of forward and backward.

ELMo has many layer, it is the weighted sum. the representation is task dependent: by the weight.

GPT model:

Transformer based.

Task specific fine-tuning, GPT applies task-specific input adaptations motivated by traversal approach. Preprocess each text input as a single contiguous sequence of tokens.

GPT2: authors argue that, training on very large datasets, the model starts to learn some common supervised NLP tasks.


The final hidden state of [CLS] is used for sentence-level tasks and the final hidden state of each token is used for token-level tasks.

TODO: For partial insight on this, we refer the readers to (Raffel et al., 2019) for a controlled comparison between unidirectional and bidirectional models,

Ways to use contextual embeddings for downsream:

Catastrophic forgetting:

Knowledge Distillation

Analyzing Contextual Embeddings:

Probe classifiers. A large body of work studies contextual embeddings using probes. These are constrained classifiers designed to explore whether syntactic and semantic information is en-coded in these representations or not.

A Structural Probe for Finding Syntax in Word Representations

Link: https://nlp.stanford.edu/pubs/hewitt2019structural.pdf

What I can learn for our C project is that: we can use the syntactic tree as a probing task! e.g. syntactic tree distance, syntactic tree patterns …