Workout: 4 Km
Pushing our C project.
Revising CV and PS for NUS PhD application due 15th June.
Pretraining can be divided into 2 types:
ELMo can be degraded into BiLSTM?
No, because ELMo is pretrained to be like that.
ELMo is essentially a Bi-LSTM, the representation for each layer is the concatenation of forward and backward.
ELMo has many layer, it is the weighted sum. the representation is task dependent: by the weight.
Task specific fine-tuning, GPT applies task-specific input adaptations motivated by traversal approach. Preprocess each text input as a single contiguous sequence of tokens.
GPT2: authors argue that, training on very large datasets, the model starts to learn some common supervised NLP tasks.
The ﬁnal hidden state of [CLS] is used for sentence-level tasks and the ﬁnal hidden state of each token is used for token-level tasks.
TODO: For partial insight on this, we refer the readers to (Raffel et al., 2019) for a controlled comparison between unidirectional and bidirectional models,
Ways to use contextual embeddings for downsream:
Analyzing Contextual Embeddings:
Probe classiﬁers. A large body of work studies contextual embeddings using probes. These are constrained classiﬁers designed to explore whether syntactic and semantic information is en-coded in these representations or not.
What I can learn for our C project is that: we can use the syntactic tree as a probing task! e.g. syntactic tree distance, syntactic tree patterns …