Healthy:
Workout: 4 Km
Research:
Pushing our C project.
Revising CV and PS for NUS PhD application due 15th June.
Link: https://arxiv.org/abs/2003.07278
Pretraining can be divided into 2 types:
ELMo can be degraded into BiLSTM?
No, because ELMo is pretrained to be like that.
ELMo is essentially a Bi-LSTM, the representation for each layer is the concatenation of forward and backward.
ELMo has many layer, it is the weighted sum. the representation is task dependent: by the weight.
GPT model:
Transformer based.
Task specific fine-tuning, GPT applies task-specific input adaptations motivated by traversal approach. Preprocess each text input as a single contiguous sequence of tokens.
GPT2: authors argue that, training on very large datasets, the model starts to learn some common supervised NLP tasks.
BERT:
The final hidden state of [CLS] is used for sentence-level tasks and the final hidden state of each token is used for token-level tasks.
TODO: For partial insight on this, we refer the readers to (Raffel et al., 2019) for a controlled comparison between unidirectional and bidirectional models,
Ways to use contextual embeddings for downsream:
Catastrophic forgetting:
Knowledge Distillation
Analyzing Contextual Embeddings:
Probe classifiers. A large body of work studies contextual embeddings using probes. These are constrained classifiers designed to explore whether syntactic and semantic information is en-coded in these representations or not.
Link: https://nlp.stanford.edu/pubs/hewitt2019structural.pdf
What I can learn for our C project is that: we can use the syntactic tree as a probing task! e.g. syntactic tree distance, syntactic tree patterns …