Meta Releases the First “Non-Parametric” Masked Language Model NPM!
It’s like GPT-3 with 500 times the amount of parameters!

Although the powerful performance of large-scale language models in the NLP field is amazing, the negative costs it brings are also serious, such as training them is too expensive and difficult to update, and it is difficult to deal with long-tail knowledge. And the language model usually uses a softmax layer with a limited vocabulary in the prediction layer, which basically does not output rare words or phrases, which greatly limits the expressive ability of the model. In order to solve the long-tail problem of the model, scholars from the University of Washington, Meta AI and the Allen Institute for Artificial Intelligence jointly proposed the first “NonParametric Masked language model (NPM)“, through the reference corpus Instead of the softmax output, a non-parametric distribution of each phrase in NPM can be efficiently trained with a contrastive objective and within-batch approximation of retrieving the full corpus.

The researchers performed zero-shot evaluations on nine closed-ended tasks and seven open-ended tasks, including spatiotemporal transitions and word-level translation tasks that emphasize the need to predict new facts or rare phrases. It was found that, regardless of whether retrieval and generation methods are used or not, NPM is significantly better than larger parameter models, such as GPT-3 with 500 times higher parameters and OPT 13B with 37 times higher performance. It is especially good at patterns (word senses or facts) and predicting rare or barely seen words such as non-Latin scripts.
Table of Contents
The first non-parametric language model
Although this problem can be alleviated by combining some existing retrieve-and-generate related work, the final prediction part of these models still needs a softmax layer to predict tokens, which does not fundamentally solve the long tail problem. NPM consists of an encoder and a reference corpus. The encoder maps text into a fixed-size vector from which NPM retrieves a phrase and fills in [MASK]. It can be seen that NPM chooses the non-parametric distribution obtained on phrases instead of using a fixed output vocabulary softmax as output.
But training non-parametric models also poses two key problems:
1. Retrieving the complete corpus during the training process is very time-consuming and labour-intensive, and researchers solve it by using the in-batch approximation of the complete corpus retrieval;
2. Learning to predict phrases of arbitrary length without a decoder is difficult, which the researchers address by extending span masking and phrase-level comparison objectives. In summary, NPM completely removes the softmax of the output vocabulary, enabling an effectively unbounded output space by predicting an arbitrary number of n-grams. The resulting model can predict “extremely rare” or even “totally unseen” words (such as Korean words), and can efficiently support unlimited vocabularies, which no existing model can do.
NPM method
The key idea of NPM is to use an encoder to map all phrases in a corpus into a dense vector space. At inference time, when given a query with [MASK], use the encoder to find the closest phrase from the corpus and fill in [MASK]. Encoder-only models are a competitive representation model, but existing encoder-only models cannot make predictions with an unknown number of tokens, making their usage limited without fine-tuning. NPM solves this problem by retrieving a phrase to fill any number of tokens in [MASK].
Reasoning
The encoder maps each distinct phrase in the reference corpus C into a dense vector space. At test time, the encoder maps masked queries into the same vector space and retrieves phrases from C to populate [MASK]. Here, C does not have to be the same as the training corpus and can be replaced or extended at test time without retraining the encoder. In practice, there are a large number of phrases in the corpus, and indexing all of them is expensive. For example, if we consider a phrase with at most l tokens (l?20), we need to index l×|C| number of vectors, which may be time-consuming. The researchers index each distinct token in C, thereby reducing the size of the index from l×|C| to |C|, and then at test time, perform a k-nearest neighbour search on the beginning and end of all phrases.
Approximate nonparametric distributions.
For example, the phrase Thessaloniki composed of 4 BPE tokens is represented by the connection of c1 and c4, which correspond to the beginning (The) and end (iki) of the phrase respectively. A query is then represented by two vectors q_start and q_end in the same vector space, and each vector is used to retrieve the start and end of plausible phrases before aggregation. The premise of this is that the representation of the beginning and the end is good enough, that is, the starting point of q is close enough to c1, and the endpoint of q is close enough to c4, which has been ensured during the training process.
Train
NPM is trained on unlabeled text data to ensure that the encoder maps the text to a nice dense vector space. There are two main difficulties in training NPM:
1.) full corpus retrieval makes training very time-consuming;
2.) filling [MASK] with arbitrary length phrases instead of tokens.
1. Mask Masking
Segment masking (span masking) is to mask the continuous token whose length is sampled from the geometric distribution. The researchers expand on this:
1.) If some fragments co-occur in other sequences in the batch, they are masked to ensure in-batch positives during training. For example, the masked segment ‘2010’, ‘the Seattle Seahawks’, and ‘to the’ all co-occur in another sequence. But for the bigram “game,” it cannot be masked together. Although they both appear in the two sequences, they do not co-occur together.
2.) Instead of replacing each token in a segment with [MASK], replace the entire segment with two special tokens [MASKs][MASKe]. For example, in the above example, regardless of the length of the masked segment, it is replaced by [MASKs][MASKe], so that the start and end vectors of each segment can be obtained, which is more convenient for reasoning.
2. Training objectives
Assuming the masked segment is the Seattle Seahawks, at test time the model should retrieve the phrase the Seattle Seahawks from other sequences in the reference corpus. While in the inference phase, the model gets vectors from [MASKs] and [MASKe] and uses them to retrieve the beginning and end of phrases from the corpus, respectively. Therefore, the goal of training should encourage the vector of [MASKs] to be closer to the in the Seattle Seahawks, and farther away from other tokens, and should not be the in any phrase, such as in become the first. It does this by training the model by approximating the full corpus to other sequences in the batch, specifically, training the model to retrieve the start and end of the Seattle Seahawks segment from other sequences in the same batch. Note that this masking strategy ensures that each masked span has a co-occurring fragment in a batch.
Experimental part
From the results, NPM outperforms other baseline models in the zero-shot setting. Among the parametric models, RoBERTa achieves the best performance, unexpectedly surpassing models including GPT-3, probably because the bi-directionality of the pure encoder model plays a crucial role, which also shows that a causal language model may not be an appropriate choice for classification. The kNN-LM method, which incorporates a nonparametric component into the parametric model, outperforms all other baselines. Nonetheless, relying on retrieval alone (kNN) performs poorly in GPT-2, suggesting that using kNN only for inference is limited.
Both NPM SINGLE and NPM significantly outperform all baselines, achieving consistently superior performance on all datasets. This shows that nonparametric models are very competitive even for tasks that do not explicitly require external knowledge. Qualitative analysis uses the prediction results of RoBERTa and NPM in the sentiment analysis task. The first example uses cheap to mean not expensive, and the second example uses cheap to mean poor quality. RoBERTa predicts positively for both examples, while NPM makes the correct prediction by retrieving contexts that use cheap in the same context as the input. It was also found that representations output by NPM lead to better word sense disambiguation. For example, RoBERTa assigns a high similarity score between cheap (cheap) and cheap (very poor quality). On the other hand, NPM successfully assigns a low similarity score between cheap and cheap, also showing that this non-parametric training with contrastive objectives is effective and can better improve representation learning, whereas kNN inference does not. The trained algorithm is completely incapable of doing so.
References:
https://arxiv.org/abs/2212.01349 This article came from the WeChat public account: Xinzhiyuan (ID: AI_era)
So guys in case you liked this post and wish to receive more tech stuff delivered to you on a daily basis then don’t forget to subscribe to the Inspire2Rise newsletter in order to obtain more such timely tech news, updates and more!
Keep visiting for more such excellent posts, internet tips, and gadget reviews, and remember we cover,
“Everything under the Sun!”

Follow Inspire2rise on Twitter. | Follow Inspire2rise on Facebook. | Follow Inspire2rise on YouTube.