A Contrastive Framework for Neural Text Generation

Authors: Yixuan Su, Tian Lan, Yan Wang, Dani Yogatama, Lingpeng Kong, and Nigel Collier

This repository contains code, models, and other related resources of our paper "A Contrastive Framework for Neural Text Generation".

🌟 Check out this great [blog] as well as this awesome [demo] that are generously supported by Huggingface (@huggingface 🤗) which compares contrastive search with other popular decoding methods. Many thanks to Huggingface 🤗!

[Use contrastive search in Huggingface transformers] In this tutorial, we demonstrate how to use contrastive search in Huggingface transformers.

If you find our paper and resources useful, please kindly leave a star and cite our papers. Thanks!

@inproceedings{su2022a,
  title={A Contrastive Framework for Neural Text Generation},
  author={Yixuan Su and Tian Lan and Yan Wang and Dani Yogatama and Lingpeng Kong and Nigel Collier},
  booktitle={Advances in Neural Information Processing Systems},
  editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},
  year={2022},
  url={https://openreview.net/forum?id=V88BafmH9Pj}
}

@article{su2023contrastive,
  title={Contrastive Search Is What You Need For Neural Text Generation},
  author={Yixuan Su and Nigel Collier},
  journal={Transactions on Machine Learning Research},
  issn={2835-8856},
  year={2023},
  url={https://openreview.net/forum?id=GbkWw3jwL9}
}

News:

[2022/11/22] Released a technical report that compares Contrastive Search with another recently proposed method, i.e. Contrastive Decoding. Check out our [paper] and [code].
[2022/11/08] 🔥 Contrastive Search has now officially supported by HuggingFace for both PyTorch and TensorFlow platforms! Check out this great [blog] as well as this awesome [demo] that are generously supported by Huggingface (@huggingface 🤗).
[2022/10/26] 🔥 We have released a new manuscript "Contrastive Search Is What You Need For Neural Text Generation" which has two takeaways: (1) Autoregressive language models are naturally isotropic, therefore SimCTG training may not be necessary; (2) Contrastive search works exceptionally well on off-the-shelf language models across 16 languages. On 12 out of the 16 evaluated languages, it even performs comparably with human-written text! Paper and code are all released. Check it out!
[2022/10/13] We have added a concise explanation on the implementations of contrastive search. Please find it [here].
[2022/09/14] 😊 SimCTG is accepted to NeurIPS 2022!
[2022/09/06] 🔥 We have added supports for the newly released OPT models (see "OPT: Open Pre-trained Transformer Language Models") by Meta. To see how to apply contrastive search on OPT models, check it [here]!
[2022/06/03] 🔥 We have released an easy-to-use library (i.e., simctg) which allows you to use SimCTG with a simple pip install simctg and a few lines of code. Check the comprehensive and huggingface-style tutorials [here] and [here]!
[2022/05/06] ⭐ We have released MAGIC, a follow up work of SimCTG, that is the SOTA method in zero-shot multi-modal text generation tasks (e.g., zero-shot image captioning and visually grounded story generation). Check it out! [paper] [code]
[2022/04/16] We have updated instructions on how to apply contrastive search on encoder-decoder models (e.g. BART and T5). More details can be found [here].
[2022/04/07] SimCTG has been publicly deployed in the AI sentence completion module of Tencent's Effidit platform (腾讯智能创作助手). Check it out and have fun!
[2022/04/02] Add support on another benchmark (ROCStories) for the task of open-ended story generation.
[2022/03/06] Example of how to adapt our approach to open-ended story generation is released.
[2022/02/15] SimCTG is publicly released!

Catalogue:

Apply Contrastive Search in Huggingface Transformers:

Here, we demonstrate how to use contrastive search in Huggingface transformers.

(1) Install Environment:

First, to install the required packages, please run the following commands:

pip install torch
pip install "transformers>=4.24.0"

(2) Generate Text with Contrastive Search:

Next, we show how to reproduce the result as in Section 2.2

# load the LMs
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
model_name = 'gpt2-large'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
model.eval()

# prepare the prefix
prefix_text = r"DeepMind Company is"
input_ids = tokenizer(prefix_text, return_tensors='pt').input_ids

# generate the result with contrastive search
output = model.generate(input_ids, penalty_alpha=0.6, top_k=4, max_length=512)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("" + 100 * '-')

Model Output: [click to expand]

Output:
----------------------------------------------------------------------------------------------------  
DeepMind Company is a leader in artificial intelligence (AI). We have a long history of working
with companies such as Google, Facebook, Amazon, and Microsoft to build products that improve
people's lives, and today we are excited to announce that DeepMind's AlphaGo program has won the
game of Go, becoming the first program to defeat a professional Go player.

The victory is a testament to the power of deep learning, and to the incredible work of our
research team, which has been at the forefront of AI research for the past five years. AlphaGo
is one of the most advanced Go programs ever created, and its performance is an important step
towards the goal of human-level AI.

"This is the culmination of a decade of hard work," said Andy Ng, co-founder and CTO of DeepMind.
"We are thrilled to have achieved this milestone and look forward to continuing to develop AI that
can be used in a wide range of applications and to help people live better lives."

DeepMind's work on Go began in 2010, when it began to train a neural network to play Go using
millions of games played by top Go players around the world. Since then, the team has refined the
algorithm, adding more and more layers of reinforcement learning to make it better at recognizing
patterns and making decisions based on those patterns. In the past year and a half, the team has
made significant progress in the game, winning a record-tying 13 games in a row to move into the
top four of the world rankings.

"The game of Go is a complex game in which players have to be very careful not to overextend their
territory, and this is something that we have been able to improve over and over again," said
Dr. Demis Hassabis, co-founder and Chief Scientific Officer of DeepMind. "We are very proud of our
team's work, and we hope that it will inspire others to take the next step in their research and
apply the same techniques to other problems."

In addition to the win in Go, DeepMind has also developed an AI system that can learn to play a
number of different games, including poker, Go, and chess. This AI system, called Tarsier, was
developed in partnership with Carnegie Mellon University and the University of California, 
Berkeley, and is being used to teach computer vision and machine learning to identify objects in
images and recognize speech in natural language. Tarsier has been trained to play the game of Go
and other games on a number of different platforms...
----------------------------------------------------------------------------------------------------

(3) Huggingface Demo:

Also check out this awesome [demo] generously supported by Huggingface (@huggingface 🤗) which compares contrastive search with other popular decoding methods. Many thanks to Huggingface!

1. Introduction: [Back to Top]

Text generation is of great importance to many natural language processing applications. However, maximization-based decoding methods (e.g. beam search) of neural language models often lead to degenerate solutions---the generated text is unnatural and contains undesirable repetitions. Existing approaches introduce stochasticity via sampling or modify training objectives to decrease probabilities of certain tokens (e.g., unlikelihood training). However, they often lead to solutions that lack coherence. In this work, we show that an underlying reason for model degeneration is the anisotropic distribution of token representations. We present a contrastive solution: (i) SimCTG, a contrastive training objective to calibrate the model's representation space, and (ii) a decoding method---contrastive search---to encourage diversity while maintaining coherence in the generated text. Extensive experiments and analyses on three benchmarks from two languages demonstrate that our proposed approach outperforms state-of-the-art text generation methods as evaluated by both human and automatic metrics.

2. Contrastive Search with GPT-2 and OPT: [Back to Top]

In this section, we illustrate how to apply contrastive search on GPT-2 models and the OPT models released by Meta.

2.1. Environment Setup:

To install our simctg package, simply using the command below. More details of the simctg package can be found [here] and [here].

pip install simctg --upgrade

2.2. Contrastive Search with GPT-2:

Let's see how to produce text with contrastive search using GPT-2 models. More details can be found here.

(i) First, we load the GPT-2 model as

import torch
from simctg.simctggpt import SimCTGGPT
model_name = r'gpt2-large'
model = SimCTGGPT(model_name)
model.eval()
tokenizer = model.tokenizer
eos_token_id = tokenizer.eos_token_id

(ii) Then, we prepare the prefix text as

prefix_text = r"DeepMind Company is"
tokens = tokenizer.tokenize(prefix_text)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.LongTensor(input_ids).view(1,-1)

(iii) Last, we generate the text with contrastive search as

beam_width, alpha, decoding_len = 4, 0.6, 256
output = model.fast_contrastive_search(input_ids=input_ids, beam_width=beam_width, 
                                       alpha=alpha, decoding_len=decoding_len,
                                      end_of_sequence_token_id = eos_token_id, early_stop = True) 
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output))
print("" + 100 * '-')

Model Output:

Output:
----------------------------------------------------------------------------------------------------
DeepMind Company is a leader in artificial intelligence (AI). We have a long history of working with 
companies such as Google, Facebook, Amazon, and Microsoft to build products that improve people's 
lives, and today we are excited to announce that DeepMind's AlphaGo program has won the game of Go,
becoming the first program to defeat a professional Go player.

The victory is a testament to the power of deep learning, and to the incredible work of our research
team, which has been at the forefront of AI research for the past five years. AlphaGo is one of the
most advanced Go programs ever created, and its performance is an important step towards the goal of
human-level AI.

"This is the culmination of a decade of hard work," said Andy Ng, co-founder and CTO of DeepMind. 
"We are thrilled to have achieved this milestone and look forward to continuing to develop AI that can
be used in a wide range of applications and to help people live better lives."

DeepMind's work on Go began in 2010, when it began to train a neural network to play Go using millions
of games played by top Go players around the world. Since then, the team has refined the algorithm,
adding more and more layers of reinforcement learning to make it...
----------------------------------------------------------------------------------------------------

For comparison, let's see the result generated by greedy search.

decoding_len = 256
output = model.greedy_search(input_ids=input_ids, decoding_len=decoding_len,
                                       end_of_sequence_token_id = eos_token_id, early_stop = True) 
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output))
print("" + 100 * '-')

Model Output:

Output:
----------------------------------------------------------------------------------------------------
DeepMind Company is a leading AI research company, with a focus on deep learning and deep learning-based systems.

The company's research is focused on the development of deep learning-based systems that can learn from large 
amounts of data, and that can be used to solve real-world problems.

DeepMind's research is also used by the UK government to develop new technologies for the UK's National Health Service.

DeepMind's research is also used by the UK government to develop new technologies for the UK's National Health Service.

DeepMind's research is also used by the UK government to develop new technologies for the UK's National Health Service.

DeepMind's research is also used by the UK government to develop new technologies for the UK's National Health Service.

DeepMind's research is also used by the UK government to develop new technologies for the UK's National Health Service.

DeepMind's research is also used by the UK government to develop new technologies for the UK's National Health Service.

DeepMind's research is also used by the UK government to develop new technologies for the UK's National Health Service.

DeepMind's research is also used by the UK government to develop new technologies for the UK's National Health Service...
----------------------------------------------------------------------------------------------------

We can also see the result generated by nucleus sampling (p=0.95).

decoding_len = 256
output = model.nucleus_sampling(input_ids=input_ids, decoding_len=decoding_len, nucleus_p=0.95,
                                       end_of_sequence_token_id = eos_token_id, early_stop = True) 
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output[1:]))
print("" + 100 * '-')

Model Output:

Output:
----------------------------------------------------------------------------------------------------
DeepMind Company is a Cardiff-based start-up with an exclusive mission to build the world’s largest ever
deep-learning system to analyse the world’s digital content and in particular, super-sized image content.
  
The system, the largest in the world with no previous expertise in image or digital content detection, 
will have previously relied on a mixture of machine learning, artificial neural networks, and storage,
processing and retrieval techniques.
  
The AI system, called ImageNet, will take new approach to our challenge of data science and machine
learning, significantly improving efficiency, natural language processing and full understanding of
complex, high-dimensional images, with an Eye of the Tiger framework for extracting techniques to
ensure correct detection of particular images in complex scenes.
  
Dr. Mark Ward, Dr. Alex Kudle, Dr. Ralph Pinchbeck and CTO, DeepMind Dr. Alex Kudle
  
Case Study: Derpy’s Most Wanted: Fighting Cybersecurity, building a robot-aided smuggling network
  
InfoSec News, 06/07/2017
  
Dimitrios Papadimitriou (left) and Chris Bardy (right) at G+ XE, July 2017
  
How to model an industrial malware botn...
----------------------------------------------------------------------------------------------------

2.3. Contrastive Search with OPT:

Let's see how to produce text with contrastive search using OPT models. More details can be found here.

(i) First, we load the OPT model as

import torch
from simctg.simctgopt import SimCTGOPT
model_name = 'facebook/opt-6.7b'
model = SimCTGOPT(model_name)
tokenizer = model.tokenizer
model.eval()
bos_token_id = tokenizer.bos_token_id
eos_token_id = tokenizer.eos_token_id

(ii) Then, we use the same example from the original paper (see Figure 9 in the Appendix E) to show how to generate text with contrastive search. The prefix text is provided as

prefix_text = r"""A chat between a curious human and the Statue of Liberty.

Human: What is your name?
Statue: I am the Statue of Liberty.
Human: Where do you live?
Statue: New York City.
Human: How long have you lived there?"""

(iii) We prepare the input ids as

[Important Tip] As the authors suggested in their [tutorial], in contrastive to GPT2, OPT adds the EOS token to the beginning of every prompt. So make sure the special token is added at the front of the prompt.

tokens = tokenizer.tokenize(prefix_text)
input_ids = [bos_token_id] + tokenizer.convert_tokens_to_ids(tokens) # adds </s> to the beginning of every prompt
input_ids = torch.LongTensor(input_ids).view(1,-1)

(iv) Last, we generate the text with contrastive search as

beam_width, alpha, decoding_len = 5, 0.6, 256
output = model.fast_contrastive_search(input_ids=input_ids, beam_width=beam_width, 
                                       alpha=alpha, decoding_len=decoding_len,
                                       end_of_sequence_token_id = eos_token_id, early_stop = True) 
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output[1:]))
print("" + 100 * '-')

Model Output:

Output:
----------------------------------------------------------------------------------------------------
A chat between a curious human and the Statue of Liberty.

Human: What is your name?
Statue: I am the Statue of Liberty.
Human: Where do you live?
Statue: New York City.
Human: How long have you lived there?
Statue: Since 1884.
Human: Why did you come to America?
Statue: I was given to the United States by France as a gift for helping the French during the Franco-Prussian War.
Human: What do you think of America?
Statue: I love it. It is the greatest country in the world.
Human: What’s the weather like in New York?
Statue: It is cold.
Human: Is it safe to walk around at night?
Statue: Yes. There are policemen everywhere.
Human: Do you have any children?
Statue: Not yet. My pedestal is empty.
Human: What would you like to say to people who want to immigrate to America?
Statue: Come on over. You will be happy here. We have everything you need.
----------------------------------------------------------------------------------------------------

For comparison, let's see the result generated by greedy search.

decoding_len = 256
output = model.greedy_search(input_ids=input_ids, decoding_len=decoding_len,
                                       end_of_sequence_token_id = eos_token_id, early_stop = True) 
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output[1:]))
print("" + 100 * '-')

Model Output:

Output:
----------------------------------------------------------------------------------------------------
A chat between a curious human and the Statue of Liberty.

Human: What is your name?
Statue: I am the Statue of Liberty.
Human: Where do you live?
Statue: New York City.
Human: How long have you lived there?
Statue: I have lived here for over 100 years.
Human: What do you do?
Statue: I welcome people from all over the world to come to America.
Human: What do you think of America?
Statue: I love America.
Human: What do you think of immigrants?
Statue: I love immigrants.
Human: What do you think of America?
Statue: I love America.
Human: What do you think of immigrants?
Statue: I love immigrants.
Human: What do you think of America?
Statue: I love America.
Human: What do you think of immigrants?
Statue: I love immigrants.
Human: What do you think of America?
Statue: I love America.
Human: What do you think of immigrants?
Statue: I love immigrants.
Human: What do you think of America?
Statue: I love America.
Human: What do you think of immigrants?
Statue: I love immigrants.
Human: What do you think of America?
Statue: I love America.
Human: What do you think of immigrants?
Statue: I love immigrants.
Human...
----------------------------------------------------------------------------------------------------

We can also see the result generated by nucleus sampling (p=0.95).

decoding_len = 256
output = model.nucleus_sampling(input_ids=input_ids, decoding_len=decoding_len, nucleus_p=0.95,
                                       end_of_sequence_token_id = eos_token_id, early_stop = True) 
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output[1:]))
print("" + 100 * '-')

Model Output:

Output:
----------------------------------------------------------------------------------------------------
A chat between a curious human and the Statue of Liberty.

Human: What is your name?
Statue: I am the Statue of Liberty.
Human: Where do you live?
Statue: New York City.
Human: How long have you lived there?
Statue: Since 1876.
Human: Why is the Statue of Liberty guarded?
Statue: Because there are many people trying to steal her.

a comparison about an unexpressed thought

I would also share the story of “A Humble Fear.” At a conference in New York the Dalai Lama gave a 
speech to the International Thinkers Congress in New York. The whole thing was recorded, and the 
video is quite interesting. (on a side note, I love the fact that there were some people who laughed
when he described himself as a humble being… I think the video is hilarious, there is a reason why
I put up the video. Because if you cannot find the humor in this you’re sadly lacking…)

In the speech, the Dalai Lama compares the search for truth to searching for treasure. He says: 
“However there is a huge difference between being a thief and a collector. A thief simply takes things, 
whereas a collector looks for the beauty, even if it is just a single object.”

The above quote is perhaps the most cliched Buddhist philosophy of our times. However the comparison
between a collector and a thief is quite interesting. I like to think that the Buddha...
----------------------------------------------------------------------------------------------------

3. Huggingface Models: [Back to Top]

Model Name	Task	Language	Training Corpus (Size)	Model Size	Model Address
cambridgeltl/simctg_wikitext103	Document Generation	English	Wikitext-103 (529MB)	117M	[link]
cambridgeltl/simctg_lccc_dialogue	Open-domain Dialogue Generation	Chinese	LCCC (708MB)	117M	[link]
cambridgeltl/simctg_english_wikipedia	General Domain Pre-training	English	Wikipedia (14.11GB)	117M	[link]
cambridgeltl/simctg_writingprompts	Open-Ended Story Generation	English	WritingPrompts (865MB)	117M	[link]
cambridgeltl/simctg_rocstories	Open-Ended Story Generation	English	ROCStories (12MB)	117M	[link]

4. Huggingface-Style Tutorials: [Back to Top]

We have encapsulated our work as an easy-to-use library (i.e., package). In the following, we provide huggingface-style tutorials on how to use SimCTG and contrastive search with just a few lines of code!

⭐ [Documentation] We have provided detailed documentation of the (i) source code of the package and (ii) instructions on how to use it. Please refer to [here].

⭐ [Google Colab] We provide a Google Colab for the easy reproductivity of our tutorial.

4.1. Install and Load SimCTG:

To use our package, we recommand you to use Python with version >= 3.6. The SimCTG can be installed and loaded with the commands below.

(1) Install SimCTG with pip.

pip install simctg --upgrade

(2) Load SimCTG package with Python.

import torch
# load SimCTG language model
from simctg.simctggpt import SimCTGGPT
# load SimCTG loss class
from simctg.lossfunction import SimCTGLoss

4.2. Example of Training Language Model with SimCTG:

4.2.1. Initialize Language Model:

model_name = r'gpt2'
# initialize the language model with a vanilla GPT-2
model = SimCTGGPT(model_name)
tokenizer = model.tokenizer

🔔 The detailed description of SimCTGGPT can be found [here].

4.2.2. Initialize Loss Class:

margin = 0.5
vocab_size = len(tokenizer)
pad_token_id = tokenizer.bos_token_id
simctgloss = SimCTGLoss(margin=margin, vocab_size=vocab_size, pad_token_id=pad_token_id)

🔔 The detailed description of SimCTGLoss can be found [here].

[Note] If the margin is set as 0.0, then the SimCTG loss is equivalent to the MLE loss.

4.2.3. Create Example Training Data:

from torch.nn.utils import rnn
text_list = ['Pandas are so cute!', 'The weather in Cambridge today is very good!']
# transform batch of texts to batch of token ids
tokens_list = [tokenizer.tokenize(text) for text in text_list]
batch_id_list = [tokenizer.convert_tokens_to_ids(item) for item in tokens_list]
batch_id_list = [torch.LongTensor(item) for item in batch_id_list]
# pad the batch of token ids
batch_tensor = rnn.pad_sequence(batch_id_list, batch_first=True, padding_value=pad_token_id)
# get batch input ids and batch label ids
batch_inputs = batch_tensor[:, :-1].clone()
batch_labels = batch_tensor[:, 1:].clone()
# by setting pad token ids as -100, we stop the gradient update on these padded positions
batch_labels[batch_labels[:, :] == pad_token_id] = -100

4.2.4. Compute Loss:

# forward computation
last_hidden_states, logits = model(input_ids=batch_inputs, labels=batch_labels)
# loss computation
mle_loss, cl_loss = simctgloss(last_hidden_states=last_hidden_states, logits=logits, 
                               input_ids=batch_inputs, labels=batch_labels)
simctg_loss = mle_loss + cl_loss

[Note] If the margin in SimCTG loss is set as 0.0, the returned cl_loss will always be 0.0 and the SimCTG loss is equivalent to the MLE loss.

4.3. Examples of Performing Generation with Contrastive Search:

4.3.1. Open-Ended Document Generation:

We show how to reproduce our result in the case study (i.e., Table 4) of our paper.

import torch
# load SimCTG language model
from simctg.simctggpt import SimCTGGPT
model_name = r'cambridgeltl/simctg_wikitext103'
model = SimCTGGPT(model_name)
model.eval()
tokenizer = model.tokenizer

# prepare input
prefix_text = r"Butt criticized Donald 's controls in certain situations in the game , as well as the difficulty of some levels and puzzles . Buchanan also criticized the controls , calling"
print ('Prefix is: {}'.format(prefix_text))
tokens = tokenizer.tokenize(prefix_text)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.LongTensor(input_ids).view(1,-1)

# generate result
beam_width, alpha, decoding_len = 8, 0.6, 128
output = model.fast_contrastive_search(input_ids=input_ids, beam_width=beam_width, 
                                       alpha=alpha, decoding_len=decoding_len) 
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output))

Model Output:

Output:
----------------------------------------------------------------------------------------------------
Butt criticized Donald's controls in certain situations in the game, as well as the difficulty of some
levels and puzzles. Buchanan also criticized the controls, calling them " unimpressive " and a " nightmare "
of an experience to play with players unfamiliar with Tetris. On the other hand, his opinion was shared by
other reviewers, and some were critical of the game's technical design for the Wii version of Tetris.
In addition, Tintin's review included a quote from Roger Ebert, who said that Tetris was better than the
original game due to its simplicity and ease of play. Ebert's comments were included in the game's DVD
commentary, released on March 22, 2010. It is unclear if any of the video commentary was taken from the DVD...

4.3.2. Open-Domain Dialogue Generation:

We show how to reproduce our result in the case study (i.e., Table 7) of our paper.

import torch
# load SimCTG language model
from simctg.simctggpt import SimCTGGPT
model_name = r'cambridgeltl/simctg_lccc_dialogue'
model = SimCTGGPT(model_name)
model.eval()
tokenizer = model.tokenizer
eos_token = '[SEP]'
eos_token_id = tokenizer.convert_tokens_to_ids([eos_token])[0]

# prepare input
context_list = ['刺猬很可爱！以前别人送了只没养，味儿太大！', '是很可爱但是非常臭', '是啊，没办法养', '那个怎么养哦不会扎手吗']
prefix_text = eos_token.join(context_list).strip(eos_token) + eos_token
print ('Prefix is: {}'.format(prefix_text))
tokens = tokenizer.tokenize(prefix_text)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.LongTensor(input_ids).view(1,-1)

# generate result
beam_width, alpha, decoding_len = 5, 0.6, 64
output = model.fast_contrastive_search(input_ids=input_ids, beam_width=beam_width, alpha=alpha, 
                                       decoding_len=decoding_len, end_of_sequence_token_id=eos_token_id,
                                       early_stop=True) 
print("Output:\n" + 100 * '-')
print(''.join(tokenizer.decode(output).split()))

Model Output:

Output:
----------------------------------------------------------------------------------------------------
刺猬很可爱！以前别人送了只没养，味儿太大！[SEP]是很可爱但是非常臭[SEP]是啊，没办法养[SEP]那个怎么养哦不会扎手吗[SEP]我觉得还好，就是有点臭

4.4. Contrastive Search with Off-the-shelf Language Models from Different Languages:

In the following, we show how to apply contrastive search on off-the-shelf language models of different languages.

4.4.1. Chinese Language Model:

import torch
# load SimCTG language model
from simctg.simctggpt import SimCTGGPT
model_name = r'uer/gpt2-chinese-cluecorpussmall'
model = SimCTGGPT(model_name)
model.eval()
tokenizer = model.tokenizer
eos_token = '[SEP]'
eos_token_id = tokenizer.convert_tokens_to_ids([eos_token])[0]

# Example 1
prefix_text = '苹果公司'
print ('Prefix is: {}'.format(prefix_text))
tokens = tokenizer.tokenize(prefix_text)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.LongTensor(input_ids).view(1,-1)

beam_width, alpha, decoding_len = 3, 0.6, 128
output = model.fast_contrastive_search(input_ids=input_ids, beam_width=beam_width, alpha=alpha, 
                                       decoding_len=decoding_len, end_of_sequence_token_id=eos_token_id,
                                       early_stop=True) 
print("Output:\n" + 100 * '-')
print(''.join(tokenizer.decode(output).split()))

Model Output:

Output:
----------------------------------------------------------------------------------------------------
苹果公司在中国市场推出的iphone7，不仅在外观设计上有所改变，在配置上也进行了升级。苹果还宣布，新一代iphone将采用5.7英寸
屏幕，分辨率达到2560×1440像素，显示效果非常出色。此外，该机还支持指纹识别功能，可实现手指快速扫描、人脸识别等功能。

# Example 2
prefix_text = '百节年为首，春节是中华民族最隆重的传统佳节。它不仅集中体现了中华'
print ('Prefix is: {}'.format(prefix_text))
tokens = tokenizer.tokenize(prefix_text)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.LongTensor(input_ids).view(1,-1)

beam_width, alpha, decoding_len = 3, 0.6, 128
output = model.fast_contrastive_search(input_ids=input_ids, beam_width=beam_width, alpha=alpha, 
                                       decoding_len=decoding_len, end_of_sequence_token_id=eos_token_id,
                                       early_stop=True) 
print("Output:\n" + 100 * '-')
print(''.join(tokenizer.decode(output).split()))

Model Output:

Output:
----------------------------------------------------------------------------------------------------
百节年为首，春节是中华民族最隆重的传统佳节。它不仅集中体现了中华文化精髓，也表现了人民群众生活水平的提高和对美好生活的向往。

4.4.2. Japanese Language Model:

import torch
# load SimCTG language model
from simctg.simctggpt import SimCTGGPT
model_name = r'colorfulscoop/gpt2-small-ja'
model = SimCTGGPT(model_name)
model.eval()
tokenizer = model.tokenizer
eos_token = tokenizer.eos_token
eos_token_id = tokenizer.convert_tokens_to_ids([eos_token])[0]

# prepare input
prefix_text = r'臥龍桜（がりゅうざくら）は、岐阜県高山市一之宮町にある一本桜。龍が地'
print ('Prefix is: {}'.format(prefix_text))
tokens = tokenizer.tokenize(prefix_text)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.LongTensor(input_ids).view(1,-1)

# generate result
beam_width, alpha, decoding_len = 5, 0.6, 128
output = model.fast_contrastive_search(input_ids=input_ids, beam_width=beam_width, alpha=alpha, 
                                       decoding_len=decoding_len, end_of_sequence_token_id=eos_token_id,
                                       early_stop=True) 
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output))

Model Output:

Output:
----------------------------------------------------------------------------------------------------
臥龍桜(がりゅうざくら)は、岐阜県高山市一之宮町にある一本桜。龍が地中に染みつく様子を図案化したもので、樹齢400年を越す日本
さくら名所100選に選定されている。一之宮町指定天然記念物。岐阜県飛騨地方(東濃地方)の山間地に生育し、約1万年前に絶滅したと
考えられている。「花の本」とも称され、開花期は5月上旬から下旬までで、桜の枝張りは濃緑色である。花は直径約10cmの花弁を咲か
せる八重咲きで、花弁の色は紅紫色で、雄しべは4本、雌しべは1本ある。雄しべの先

4.4.3. Korean Language Model:

import torch
# load SimCTG language model
from simctg.simctggpt import SimCTGGPT
model_name = r'skt/ko-gpt-trinity-1.2B-v0.5'
model = SimCTGGPT(model_name)
model.eval()
tokenizer = model.tokenizer
eos_token = tokenizer.eos_token
eos_token_id = tokenizer.convert_tokens_to_ids([eos_token])[0]

# prepare input
prefix_text = r'인간처럼 생각하고, 행동하는 \'지능\'을 통해 인류가 이제까지 풀지 못했던'
print ('Prefix is: {}'.format(prefix_text))
tokens = tokenizer.tokenize(prefix_text)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.LongTensor(input_ids).view(1,-1)

# generate result
beam_width, alpha, decoding_len = 5, 0.6, 64
output = model.fast_contrastive_search(input_ids=input_ids, beam_width=beam_width, alpha=alpha, 
                                       decoding_len=decoding_len, end_of_sequence_token_id=eos_token_id,
                                       early_stop=True) 
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output))

Model Output:

Output:
----------------------------------------------------------------------------------------------------
인간처럼 생각하고, 행동하는 \'지능\'을 통해 인류가 이제까지 풀지 못했던 난제를 해결하려 한다. 이 책의 제목이기도 한 '슈퍼인텔리전스'는 
인공지능(AI)의 등장으로 야기된 사회 변화를 일컫는 말로, 이 책을 관통하는 키워드이기도 하다. 저자는 "기술과 인간 사이의 경계가 무너지고 
있다"고 지적한다. AI가 인간의 사고방식과 행동을 모방할 뿐만

4.5. Detailed Tutorial of Training SimCTG on Wikitext-103:

We also provide a comprehensive tutorial on how to reproduce our experiments on Wikitext-103 using the released package. Check it [here]!

4.6. Apply SimCTG on T5:

We also provide detailed tutorials on how to apply SimCTG and contrastive search on T5 model. For more details, please refer to [here] and [here].

5. Environment Setup: [Back to Top]

python version >= 3.6
pip3 install -r requirements.txt

❗❗❗ [Note] The following instructions were originally used in the experiments of our paper. Now we have provided an easy-to-use library which helps you to implement SimCTG with just a few lines of code (Of course, the original code still works!). Check it [here]!

6. Example Usage of Contrastive Search: [Back to Top]

6.1. Use SimCTG Pretrained on Wikipedia Corpus:

Here, we show how to use contrastive search to generate the result.

import torch
import sys
sys.path.append(r'./pretraining')
from simctg import SimCTGPretraining
# load SimCTG model pretrained on the large-scale Wikipedia corpus
model_path = r'cambridgeltl/simctg_english_wikipedia'
model = SimCTGPretraining(model_path)
model.eval()

# we randomly select a prefix from the dev set of Wikipedia pre-training corpus and prepare the text prefix input
text = r'Insect farming is the practice of raising and breeding insects as livestock, also referred to as minilivestock or micro stock. Insects may be farmed for the commodities'
tokens = model.tokenizer.tokenize(text)
input_ids = model.tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.LongTensor(input_ids).view(1,-1)

# use contrastive search to generate the result
beam_width, alpha, decoding_len = 5, 0.6, 128
eos_token = '<|endoftext|>'
print (model.fast_contrastive_search(input_ids, beam_width, alpha, decoding_len, eos_token))

Model Output:

Insect farming is the practice of raising and breeding insects as livestock, also referred to as minilivestock
or micro stock. Insects may be farmed for the  commodities they produce, such as honey, corn, sorghum, and 
other crops. In some cases, the production of insects is a way to increase income for the owner or his family. 
This type of farming has been described as "an economic system that benefits all people regardless of race, sex, 
or social status" (p.\xa09). A large number of farmers in North America, Europe, and South America have used the 
method of farming for food production in order to feed their families and livestock. The most common method of 
farming is by hand-cropping, which consists of cutting a hole in the ground and using a saw

More details on how to pre-train SimCTG on large-scale corpus and the details of the argument setup in contrastive search can be found [here].

6.2. Use Off-the-shelf Language Models from Different Languages:

Importantly, we found that contrastive search can be directly applied to off-the-shelf language models even without contrastive training. The only condition is that

Name		Name	Last commit message	Last commit date
Latest commit History 671 Commits
SimCTGEncDec		SimCTGEncDec
contrastive_search_explanation		contrastive_search_explanation
data		data
dialogue_generation		dialogue_generation
document_generation		document_generation
pretraining		pretraining
simctg		simctg
story_generation		story_generation
training_tutorial_on_wikitext103		training_tutorial_on_wikitext103
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

yxuansu/SimCTG

Folders and files

Latest commit

History

Repository files navigation

A Contrastive Framework for Neural Text Generation

News:

Catalogue:

Apply Contrastive Search in Huggingface Transformers:

(1) Install Environment:

(2) Generate Text with Contrastive Search:

(3) Huggingface Demo:

1. Introduction: [Back to Top]

2. Contrastive Search with GPT-2 and OPT: [Back to Top]

2.1. Environment Setup:

2.2. Contrastive Search with GPT-2:

2.3. Contrastive Search with OPT:

3. Huggingface Models: [Back to Top]

4. Huggingface-Style Tutorials: [Back to Top]

4.1. Install and Load SimCTG:

4.2. Example of Training Language Model with SimCTG:

4.2.1. Initialize Language Model:

4.2.2. Initialize Loss Class:

4.2.3. Create Example Training Data:

4.2.4. Compute Loss:

4.3. Examples of Performing Generation with Contrastive Search:

4.3.1. Open-Ended Document Generation:

4.3.2. Open-Domain Dialogue Generation:

4.4. Contrastive Search with Off-the-shelf Language Models from Different Languages:

4.4.1. Chinese Language Model:

4.4.2. Japanese Language Model:

4.4.3. Korean Language Model:

4.5. Detailed Tutorial of Training SimCTG on Wikitext-103:

4.6. Apply SimCTG on T5:

5. Environment Setup: [Back to Top]

6. Example Usage of Contrastive Search: [Back to Top]

6.1. Use SimCTG Pretrained on Wikipedia Corpus:

6.2. Use Off-the-shelf Language Models from Different Languages:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages