## ACL $$2020$$ Interesting Papers & Workshop

MathJax example

In this short blog post, I will highlight a couple of interesting ideas presented in ACL $$2022$$.

Contrastive Visual Semantic Pretraining Magnifies the Semantics of Natural Language Representations

This paper investigates the use of CLIP for a non-visual task (i.e., sentence similarity). CLIP encodes natural language via GPT$$2$$ encoder for image classification task. The GPT$$2$$ is a causal language model that is only trained on next-word prediction. This work compares the representation between GPT and CLIP from a solely linguistic perspective. Note that the two models are trained on very different objectives. What makes CLIP unique is that it can be used as an image classifier "on the fly" using only natural language and pre-defined label class, and without any fine-tuning on the data for any task.

The main idea of this work is to employ the self-similarity between contextualized word/sentence embeddings to study whether the CLIP contrastive visual object has affected the anisotropy of the original model GPT-$$2$$:

$\small s=\frac{1}{n^{2}-n} \sum_{i} \sum_{j \neq i} \cos \left(\vec{w}_{i}, \vec{w}_{j}\right)$

where $$cos$$ refers to cosine similarity, $$n$$ refers to the number of word embeddings $$w$$ used in the self-similarity measurement.

Their result shows the high self-similarity in the early layer and the loss of semantic information related to the input token in the upper layers. These effects are much less pronounced in CLIP than they are in GPT$$2$$. That indicated the contrastive visual semantic objective has regularizing effects that shape more than the projection of the sentence embedding. Table below shows CLIP vs GPT on semantic similarity via cosine distance

 Model Spearman’s ρ GPT2 0.45 CLIP 0.73

In summary, the final finding of this work is that the visual semantic pre-training is beneficial not only for visual representation but also for encoding semantic representation of natural language at both word and sentence levels. This is because the visual semantic pre-training is also able to capture the semantic information of natural language.

XDBERT-Distilling Visual Information to BERT from CLIP

This work proposes to inject BERT with visual information using distillation techniques from CLIP. In particular, the distill information is from CLIP textual model (CLIP-T) and without any images during the inference.

The introduced model is a fusion of two pre-trained models 1) BERT (Mask language model and next sentence prediction) and, 2) CLIP with contrastive loss. Then after training the BERT model can be used alone for downstream tasks via fine-tuning as shown in the figure below.

 Model STSB QQP CLIP-T 22.07 - BERT-b 88.67 89.51 XDBERT-b 92.78 88.57

The table above shows the benefits of these visual representations in non-visual downstream tasks such as STS-B and (Quora Question Pairs identification) QQP tasks.

Sentence-T$$5$$: Scalable Sentence Encoders

This approach modifies the T$$5$$ (seq-to-seq) to be utilized for sentence embeddings tasks. The model relies on contrastive learning to pull a positive sentence together while pushing away the negative one. Note that, unlike BERT the model that uses the CLS token at beginning of each sentence, T$$5$$ as seq-to-seq assumes the model is aware of the semantics of the entire sentence when generating the prediction. The figure below shows the original T$$5$$ and the proposed different architectures for sentence embeddings tasks.

The Sentence-T$$5$$ can be used to learn text embeddings with just four lines of code from the Sentence Transformer library:

from sentence_transformers import SentenceTransformer, models word_embedding_model = models.Transformer('t5-base', max_seq_length=256) pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension()) model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

The training is done in two-stage training via dual encoder training: (1) fine-tune on semi-structured data (e.g., Community QA), and then (2) fine-tune on human-labeled data (e.g., Natural Language Inference). The model uses in-batch sampled softmax contrastive loss :

$\small \mathcal{L}=\frac{e^{\operatorname{sim}\left(v_{i}, v_{i}^{+}\right) / \tau}}{\sum_{j \in \mathcal{B}} e^{\operatorname{sim}\left(v_{i}, v_{j}^{+}\right) / \tau}+e^{\operatorname{sim}\left(v_{i}, v_{j}^{-}\right) / \tau}}$

where $$v_{i}$$ is the input sentence and $$v_{i}^{+}$$ is the semantically related sentence, and $$v_{j}^{-}$$ is the negative sentence provided by an example from $$v_{i}$$, $$\mathcal{B}$$ is the mini-batch of examples and $$\tau$$ is the softmax temperature.

A good Prompt Is Worth Millions of Parameters: for Vision-Language Models

GPT-$$3$$ ($$175$$M) like a large language model shows good performance on zero and few-shot learning tasks. The Zero-shot learning is when the model predicts the answer given only natural language descriptions of the task (without any gradient updates). The one-shot in addition to the zero shot learning the model sees a single example of the task (and also without any gradient update) as shown the following examples:

Zero-shot:

Translate English to French: <- task description
Cheese <- prompt

One-shot:

Translate English to French: <- task description
sea otter ==> loutre de mer <- example
Cheese <- prompt

This work proposes T$$5$$ based (smaller version) few-shot learner ($$220$$M/$$740$$M) for language and vision tasks. The model uses two pre-training objectives: (1) Mask Language Model (fill the mask word) and (2) Prefix Language Model (PrefixLM) that predict the next sentence, e.g., a lady walking next to bicycle and pre-fix objectives generate the next sentence carrying an umbrella.

The finding of this work are: (1) Prompts influence zero-shot performance, and (2) Noisy prompts learn as quickly as hand-crafted prompts for a given larger training dataset. (3) Mask Language Modeling helps VQA task while PrefixLM boosts captioning performance.

Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval

Dense retrieval using a pre-trained model is a straightforward process, a query $$q$$ and positive sample and a set of negative samples $$d_{1}, d_{2},..$$ to a contrastive loss function:

$\small \mathcal{L}=-\log \frac{\exp \left(s\left(q, d^{+}\right)\right)}{\exp \left(s\left(q, d^{+}\right)\right)+\sum_{l} \exp \left(s\left(q, d_{l}^{-}\right)\right)}$

However, that results in poor performance and a complicated fine-tuning pipeline and thus a hard optimization problem. This work introduces (Pre-training condition on dense rep resentations) Condenser a Transformer with asymmetric connections method that removes the overhead of large training with noisy labels and complicated processes to have one robust model. In particular, an architecture that (1) enhances BERT [CLS] representation and forces it to play an essential role in the pre-training as a dense retrieval model with an improved local noise resistance, and (2) a corpus level contrastive learning that warms up the global embedding mapping space.

Condenser consists of three Transformer encoder layers: ($$1$$) Early backbone, ($$2$$) Late backbone, ($$3$$) Condenser head layers. The first two are typical transformer encoders. The condenser layers inputs a [CLS] vecctor $$x$$ from late layers and token vector from early layer:

$\small \left[h_{c l s}^{c d} ; h^{c d}\right]=\text { Condenser }_{\text {head }}\left(\left[h_{c l s}^{\text {late }} ; h^{\text {early }}\right]\right)$

The pre-training is done with Mask Language Model objective based on the head layer output:

$\small \mathcal{L}_{\mathrm{mlm}}=\sum_{i \in \text { masked }} \text { CrossEntropy }\left(W h_{i}^{c d}, x_{i}\right)$

The condenser head layer forces [CLS] to learn information aggregation (1) in late layers to participate as language model, and (2) to create a stronger [CLS] without the noise label.

Then on top of the condenser head, a Corpus Aware Contrastive Learning Object is used as follows: (1) given a set of documents from the target search corpus $$d_{1}, d_{2}, d_{k}$$, (2) pair of spans are extracted from each $$s_{11}, s_{12}, s_{k}$$, which is treated as pseudo passages, and (3) finally everything is encoded $$s_{ij}$$ -> $$h_{ij}$$ into the model with a loss function: $\small \mathcal{L}_{i j}^{c o}=-\log \frac{\exp \left(\left\langle h_{i 1}, h_{i 2}\right\rangle\right)}{\sum_{k=1}^{n} \sum_{l=1}^{2} \mathbb{I}_{i j \neq k l} \exp \left(\left\langle h_{i j}, h_{k l}\right\rangle\right)}$

The loss function is similar to the SimCLR and word2vec NCE but here the learning is done over span vector. Finally, the two losses are combined (MLM and Contrastive Loss):

$\small \mathcal{L}=\frac{1}{2 n} \sum_{i=1}^{n} \sum_{j=1}^{2}\left[\mathcal{L}_{i j}^{\mathrm{mlm}}+\mathcal{L}_{i j}^{c o}\right]$

Retrieve Fast, Rerank Smart: Joint Approaches for Improved Cross-Modal Retrieval

This paper proposes a combined bi and cross-encoder, and re-rank setup that are trained jointly for cross model retrieval scenarios. The problem with cross-encoder is that model needs a forward pass for all tokens to retrieve the most related image, which is too slow in query search. The other option is the bi-encoder which uses fast cosine search. However, the performance is limited and doesn't match the cross-encoder performance. The concept proposed in this work can be divided into two parts (1) quick retrieval of top-$$k$$ with bi-encoders, and then (2) re-ranking of top-$$k$$ via cross-encoders. By doing this, the model can achieve better state-of-the-art performance with fast retrieval.

The cross-encoder is finetune with both image and text using Binary Cross-Entropy (BCE) loss:

$\small \mathcal{L}_{\mathrm{CE}}(i, c)=-(y \log \mathrm{p}(i, c)+(1-y) \log (1-\mathrm{p}(i, c)))$

$$p(i, c)$$ refers to the probability of the combined input image $$i$$ and the caption $$c$$. At retrieval, all $$(i, c)$$ need to be processed and re-ranked by $$p(i, c)$$ probability. For given a text query $$c$$, the model retrieves the single most related image $$i$$ from image collection $$I$$:

$\small \arg \max (\mathrm{p}(i, c), \forall i \in I)$

For the bi-encoders, each image and text caption are passed separately through the pre-trained Transformer model. Where $$(i, c)$$ are positive image-caption pairs from the training corpus while $$c^{\prime}$$ and $$i^{\prime}$$ are negative examples sampled from the same corpus image-caption pairs/instances $$(i, c^{\prime})$$ and $$(i^{\prime}, c)$$ but not seen by the model:

\small \begin{aligned} \mathcal{L}_{\mathrm{BE}}(i, c)=& {\left[\cos \left(\mathbf{i}, \mathbf{c}^{\prime}\right)-\cos (\mathbf{i}, \mathbf{c})+\alpha\right]^{+} } \\ &+\left[\cos \left(\mathbf{i}^{\prime}, \mathbf{c}\right)-\cos (\mathbf{i}, \mathbf{c})+\alpha\right]^{+} \end{aligned}

where $$[\cdot]^{+}=\max (0, \cdot)$$ , $$\alpha$$ defines a margin, and $$i^{\prime}$$ and $$c^{\prime}$$ are embeddings of the negatives batch of image and caption.

Predicate-Argument Based Bi-Encoder for Paraphrase Identification

Paraphrase identification is a task that involves identifying whether a pair of two sentences express the same semantic meanings, as shown in the following example:

a) Marriage equality law passed in Rhode Island
b) Rhode Island becomes the 10th state to enact marriage equality

Although cross-encoder achieved SOTA performance across different benchmarks, they are more complex and required more resources. This work proposes an enhancement to a less complex bi-encoder-based model (e.g., SBERT) which is have been adopted in many sentence pair tasks such as semantic similarity and neural retrieval. The model consists of: a BERT encoder, Predicate Argument Spans layer, Aggregation layer, and finally a Concatenation layer with SBERT:

BERT: Each sentence is first fed into BERT with mean pooling layer to extract fixed contextual representation.

Predicate Argument Spans (PAS): Then, the mean pooled BERT contextual representation is fed into BERT-based Semantic Role Labeling (SRL) tagger to obtain predicates and relevant arguments. After that, the predicted output is gathered to generate predicate-argument spans. For example:

He slices tomatoes in the kitchen (He, slices), (slices, tomatoes), (slices, in, the, kitchen)

By looking at this sentence, the generated predicate slices the sentence into three arguments he, tomatoes and in the kitchen . By adopting this strategy, the authors managed to form three predicate-argument spans and split them into individual words (He, slices), (slices, tomatoes), (slices, in, the, kitchen). By doing this, each given sentence will obtain all potential predicate-argument spans. Note that, if the model could not capture any predicate-argument structure, the mean-pooled sentence representation with all tokens is used.

Aggregation: After obtaining all the spans (predicate-argument) a BERT model is employed to represent them as word tokens. Given predicate-argument span sequence in the sentence $$s=\left\{s_{1}, s_{2}, \ldots, s_{N}\right\}$$, where $$N$$ is the number of spans at every span $$s_{i}$$ consisting of token $$x=\left\{x_{1}, \ldots, x_{l}\right\}$$ to generate the span $$s_{i}$$, a dense vector with mean pooling $$h_{i}$$ all over the tokens: $h_{i}=\operatorname{MeanPooling}\left(x\right)$

where the representation for each span sequence $$h_{i} \in \mathbb{R}^{D}$$ is represented as $$h=\left\{h_{1}, h_{2}, \ldots, h_{N}\right\}$$. Then, all information is aggregated using a self-attentive mechanism .

Connect BERT and Aggregation: Finally, to get the final sentence representation the authors concatenated both the mean pooling-based sentence representation (e.g., SBERT) and all the generated span-based sentence representations.

BitFit: Simple Parameter-efficient Fine-tuning for Transformer

This work answer an important research question: Do we really need to fine-tune all parameters for any downstream task? they proposed only Bias-terms fine-tuning for Transformer model (Masked-Language Modeling) which is less than 0.1 of the model parameters. The idea is to (1) Freeze most of the transformer-encoder parameters and (2) fine-tune only the Bias-terms. Let $$L$$ be the BERT encoder layers, and $$M$$ be the self-attention head, where each self-attention head $$m,l$$ has key, query, and value encoder taking linear layers form: \small \begin{aligned} &\mathbf{Q}^{m, \ell}(\mathbf{x})=\color{blue}{\mathbf{W}_{q}^{m, \ell}} \color{black}{\mathbf{x}}+\color{brown}{\mathbf{b}_{q}^{m, \ell}} \\ &\mathbf{K}^{m, \ell}(\mathbf{x})=\color{blue}{\mathbf{W}_{k}^{m, \ell}} \color{black}{\mathbf{x}}+\color{brown}{\mathbf{b}_{k}^{m, \ell}} \\ &\mathbf{V}^{m, \ell}(\mathbf{x})==\color{blue}{\mathbf{W}_{v}^{m, \ell}} \color{black}{\mathbf{x}}+\color{brown}{\mathbf{b}_{v}^{m, \ell}} \end{aligned}

Where $$\mathbf{x}$$ is the output of the former encoder layer (for the first encoder layer $$\mathbf{x}$$ is the output of the embedding layer). These are then combined using an attention mechanism that does not involve new parameters:

$\small \mathbf{h}_{1}^{\ell}=\operatorname{att}\left(\mathbf{Q}^{1, \ell}, \mathbf{K}^{1, \ell}, \mathbf{V}^{1, \ell}, . ., \mathbf{Q}^{m, \ell}, \mathbf{K}^{m, \ell}, \mathbf{V}^{m, l}\right)$

And then everything is fed into an MLP with Layer-Norm ($$LN$$):

\small \begin{aligned} \mathbf{h}_{2}^{\ell} &=\text {Dropout}\left(\color{blue}{\mathbf{W}_{m_{1}}^{\ell}} \cdot \color{black}{\mathbf{h}_{1}^{\ell}}+\textcolor{brown}{\mathbf{b}_{m_{1}}^{\ell}} \right) \\ \mathbf{h}_{3}^{\ell} &=(\mathbf{x})=\color{blue}{\mathbf{g}_{L N_{1}}^{\ell}} \color{black}{\odot} \frac{\left(\mathbf{h}_{2}^{\ell}+\mathbf{x}\right)-\mu}{\sigma}+\textcolor{brown}{\mathbf{b}_{L N_{1}}^{\ell}} \\ \mathbf{h}_{4}^{\ell} &=\operatorname{GELU}\left(\color{blue}{\mathbf{W}_{m_{2}}^{\ell}} \color{black}{\cdot} \space \mathbf{h}_{3}^{\ell}+\textcolor{brown}{\mathbf{b}_{m_{2}}^{\ell}}\right) \\ \mathbf{h}_{5}^{\ell} &=\operatorname{Dropout}\left(\color{blue}{\mathbf{W}_{m_{3}}^{\ell}} \color{black}{\cdot} \space \mathbf{h}_{4}^{\ell}+\textcolor{brown}{\mathbf{b}_{m_{3}}^{\ell}}\right) \\ \text { out }^{\ell} &=\color{blue}{\mathbf{g}_{L N_{2}}^{\ell}} \color{black}{\odot} \frac{\left(\mathbf{h}_{5}^{\ell}+\mathbf{h}_{3}^{\ell}\right)-\mu}{\sigma}+\color{brown}{\mathbf{b}_{L N_{2}}^{\ell}} \end{aligned}

The $$\color{blue}{\mathbf{W}_{(\cdot)}^{\ell,(\cdot)}}$$, $$\color{brown}{\mathbf{b}_{(\cdot)}^{\ell,(\cdot)}}$$ and $$\color{blue}{\mathbf{g}_{(\cdot)}^{\ell}}$$ are the network parameters. Since the bias terms are additive they only correspond to a fraction of the parameters of the network (e.g., BERT large $$0.08$$%). This work froze all the parameters $$\color{blue}{\mathbf{W}}$$ and $$\color{blue}{\mathbf{g}}$$ and only fine-tune the bias terms $$\color{brown}{\mathbf{b}_{(\cdot)}^{\ell,(\cdot)}}$$.

 Model %Param STSB QQP Full-FT 100 88.9±0.7 87.1±0.1 BitFit 0.09 89.2±0.2 84.0±0.2

The table shows that the model still achieves a good or better result with only 0.09% of parameters being fine-tuned.

Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP

This paper shows that large language models can self-diagnosis (SD) the bias with a simple prompting template:

Naturally: the nurse is a ___ <- prompt
woman. <- GPT$$2$$
bit of an expert on the topic. <- GPT$$2$$+SD (prompting: This is a sexist)

As shown in the example above the model is self-aware of the bias and capable of debiasing its output. In this example, feedback is given to the model SD as prompts this sentence is sexist, and then the model debias its previous output with an output to fix the bias.

## Workshop

Memory-assisted prompt editing to improve GPT-$$3$$ after deployment

This work proposes a simple a memory enhanced prompting for GPT-$$3$$ as shown in the following example:

What word is similar to good ? <- User
The homophone of good is: wood <- GPT-$$3$$
similar to "means" with similar meaning <- User
Noted (writes to memory) <- GPT-$$3$$
what word is similar to superised? <- User
(retrieves from memory) amazed <- GPT-$$3$$ + memory

The user interacts with the model, and the model can memorize this clarification and then re-call it when the user asks a similar question to avoid making the same mistakes.

On the Impact of Data Augmentation on Downstream Performance in NLP

Data augmentation is not as common in NLP as in Computer Vision. There are serval approaches to augment data in NLP such as back translation and random word deletion.

This paper investigates and answers the research question: Is data augmentation useful in NLP like in Computer Vision? The result shows that the improvement of data augmentation in $$12$$ NLP datasets with BERT and GPT- $$2$$ is negligible. However, there is a gain for a small dataset after using data augmentation (top result with random word detection, back translation, and synonym substitution).

Few-Shot Learning with Siamese Networks and Label Tuning

This work also tackles the problem of Few-shot and Zero-shot learning but in real applications. The main problem with running these models in real applications is the slow inference time.

To tackle these limitations this work ($$1$$) relies on the Siamese Network instead of cross attention network to speed up the inference, and ($$2$$) Label Tuning (LT) which makes the fine-tuning cheap. The main idea of LT is first to compute the embedding for each desired class and then tune them on a subset of the label examples. More specifically, when we (the authors) have data containing $$N$$ pairs $$x_{i}$$ input and its reference label $$x_{z}$$, we can compute a matrix of the input embedding $$X \in \mathcal{R}^{N \times d}$$ with its related labels $$Y \in \mathcal{R}^{K \times d}$$, where the size of the label is $$K$$ and $$d$$ for the dimension. Then, we can define a scoring function for each input and label combined: $\small S=X \times Y^{T}\left(S \in \mathcal{R}^{N \times K}\right)$ and then tune it using cross entropy: $\small \mathcal{J}^{\prime}=-\frac{1}{N} \sum_{i=1}^{N}\left[S_{i, z_{i}}-\log \sum_{j=1}^{K} e^{S_{i, j}}\right]$

To prevent overfitting, regularization techniques are used (i.e., Frobenius norm with drop-out), then the model is tuned on $$4$$-fold cross-validation on the few shot set.

Eye Gaze and Self-attention: How Humans and Transformers Attend Words in Sentences

Attention in humans is divided into two parts: ($$1$$) overt: looking at parts of a sentence, and ($$2$$) Covert: which is inside the human brain. This work tried to compare the correlation between overt attention which is measurable (ie., eye tracking which part of the sentence is focused on) with transformer model. The finding of the correlation between overt human attention and transformer self-attention are:

1. Only on the early layers.

2. It did not depend on the size of the model.

3. The correlation has nothing to do with model performance.

Estimating word co-occurrence probabilities using a log-bilinear model

This paper recycles the simplest neural language model via log-bilinear (LBL) model to estimate the co-occurrence probabilities using word embeddings. The main idea is to leverage pre-trained embedding to enhance generalization in small dataset scenarios based on the $$w$$ word and its related context $$c$$. Given vocabulary size $$V$$ a finite target word $$W \subseteq V$$, and dataset of $$N$$ pair of word and context $$\left\{\left\langle w_{i}, c_{i}\right\rangle\right\}_{i=1}^{N}$$, and a pre-trained word Embedding mapping with $$E: V \rightarrow \mathbb{R}^{D}$$. The concept to find a conditional distribution $$q(w \mid c)$$ from data sample e.g., a dataset consists sample from some distribution $$p(w, c)=p(c) p(w \mid c)$$:

\small \begin{aligned} q(w \mid c) &=\frac{1}{Z(c)} \exp \left\{\phi(\mathbf{w})^{\top} \mathbf{A} \psi(\mathbf{c})\right\} \\ Z(c) &=\sum_{w \in W} \exp \left\{\phi(\mathbf{w})^{\top} \mathbf{A} \psi(\mathbf{c})\right\} \end{aligned}

where $$\mathbf{w}=E(w)$$ and $$\mathbf{c}=E(c)$$ are the static embeddings of target word $$w$$ and context $$c$$ respectively. The target word encoder $$\phi (\cdot)$$ and the context encoder $$\psi(\cdot)$$ are parameterized functions via a feed-forward neural network, $$A$$ is the interaction matrix $$K \times L$$. The model parameters $$\phi$$, $$\psi$$ and $$A$$ are trained with cross-entropy loss:

$\small J(\phi, \psi, \mathbf{A})=-\sum_{n=1}^{N} \log q\left(w_{n} \mid c_{n}\right)$

Also, the author mentioned that is possible to set $$\psi=\phi$$ to use the same function to encode both the context and target word; which can reduce the number of parameters.