llm-play is a tool for UNIX environments that automates querying multiple LLMs using multiple prompts and generating multiple responses. It helps extracting answers from these responses, partitioning the answers into equivalence classes, and running simple experimental pipelines. It can save results in a filesystem tree, CSV or JSON files.
Install the tool from PyPI:
pip install llm-play
Configure API providers and models interactively (with settings editable in ~/.llm_play.yaml
):
llm-play --add-provider
llm-play --add-model
An LLM can be queried using an argument, a specified prompt file, standard input (stdin), or text entered through an editor:
llm-play "What is the capital of China?"
llm-play --prompt prompt.md
llm-play < prompt.md
llm-play -e # enter prompt in system's $EDITOR
In all these cases, the response is printed on stdout, and can be redirected to a file:
llm-play "What is the capital of China?" > output.md
Default settings such as the model and its temperature can be configured interactively with -c/--configure
(with settings editable in ~/.llm_play.yaml
):
llm-play -c
Command-line options take precedence over the default settings. --version
prints the version; --help
print the help message.
When the number of models or prompts or responses exceeds one, the tool operates in batch mode. For example, to sample 10 responses from two models (qwen2.5-7b-instruct
and qwen2.5-coder-7b-instruct
) with a temperature of 0.5, use the command:
llm-play --prompt prompts/question1.md \
--model qwen2.5-72b-instruct qwen2.5-7b-instruct \
-t 0.5 \
-n 10
In batch mode, a short summary of responses will be printed on stdout:
Model │ Temp. │ Label │ Hash │ ID │ Class │ Content
─────────────────────┼───────┼───────────┼────────────┼──────┼───────┼────────
qwen2.5-72b-instruct │ 0.5 │ question1 │ 4ae91f5... │ 0 │ 0 │ "It ...
qwen2.5-72b-instruct │ 0.5 │ question1 │ 4ae91f5... │ 1 │ 1 │ "It ...
qwen2.5-72b-instruct │ 0.5 │ question1 │ 4ae91f5... │ 2 │ 2 │ "It ...
...
In this table, question1
is the prompt label, 4ae91f5bd6090fb6
is its SHAKE128 length=8 hash. Prompts with repeating hashes are skipped. The Class
column displays the IDs of equivalence classes of responses (see Partitioning).
To store results, the output needs to be specified with --output
. For example, --output samples
will save the results in the following filesystem tree:
samples
├── qwen2.5-7b-instruct_0.5
│ ├── question1_4ae91f5bd6090fb6.md
│ └── question1_4ae91f5bd6090fb6
│ ├── 0_0.md
│ ...
│ └── 9_9.md
└── qwen2.5-coder-7b-instruct_0.5
├── question1_4ae91f5bd6090fb6.md
└── question1_4ae91f5bd6090fb6
├── 0_0.md
...
└── 9_9.md
In this tree, question1_4ae91f5bd6090fb6.md
contains the prompt; 0_0.md
, ..., 9_9.md
are the samples. In 5_3.md
, 5
is the sample identifier, and 3
is the identifier of its equivalence class. The sample file extension can be specified using the --extension
options, e.g. --extension py
.
The data can also be stored in CSV and JSON formats (see Data Formats).
Multiple prompt files can be specified as inputs, e.g. using all *.md
files in the current directory:
llm-play --prompt *.md --output samples
When the argument of --prompt
is a directory, all *.md
files are loaded from this directory non-recursively. If the query originates from a file, the prompt will adopt the file's name (excluding the extension) as its label. When a query is supplied through stdin or as a command-line argument, the label is empty.
Multiple outputs can be specified at the same time, e.g.
--output samples samples.json
Data transformation can be used, for example, to extract relevant information from the generated samples or from data extracted in earlier stages. This is to extract text within the tag <answer> ... </answer>
from all samples in samples
, and save the results into the directory extracted
:
llm-play --map samples \
--function __FIRST_TAGGED_ANSWER__ \
--output extracted
The above function searches for text wrapped with <answer>
and </answer>
and prints only the content inside the tags.
Transformation is performed by either builtin functions or shell commands. The builtin function __ID__
simply returns the entire string without modification. The builtin function __FIRST_TAGGED_ANSWER__
returns the first occurence of a string wrapped into the tag <answer></answer>
. The builtin function __FIRST_MARKDOWN_CODE_BLOCK__
extract the content of the first Markdown code block.
Function defined through shell commands should use the shell template language. For example, this is to count the number of characters in each response:
--function 'wc -m < %%ESCAPED_DATA_FILE%%'
A transformation of a datum fails iff the function terminates with a non-zero exit code; in this case, the datum is ignored. Thus, shell commands can also be used for data filtering. For example, this is to filter out responses longer than 50 characters:
--function '(( $(wc -m < %%ESCAPED_DATA_FILE%%) <= 50 )) && cat %%ESCAPED_DATA_FILE%%'
Answers can also be extracted by LLMs. For example, this function checks if a prevously received response is affirmative:
--function "llm-play '<answer>'%%CONDENSED_ESCAPED_DATA%%'</answer>. Is this answer affirmative? Respond Yes or No.' --model qwen2.5-72b-instruct --answer"
Data can be extracted on-the-fly while querying LLMs if --function
is explicitly provided:
llm-play "Name a city in China. Your answer should be formatted like **CITY NAME**" \
--function "grep -o '\*\*[^*]*\*\*' %%ESCAPED_DATA_FILE%% | head -n 1 | sed 's/\*\*//g'"
There are convenience options to simplify extracting answers or code. The option --answer
automatically augment the prompt and apply the necessary transformation to extract the relevant parts of the response:
llm-play "${QUESTION}" --answer
is equivalent to
llm-play "${QUESTION} Wrap the final answer with <answer></answer>."" --function __FIRST_TAGGED_ANSWER__
The option --code
extracts a code block from Markdown formatting.
llm-play "Write a Python function that computes the n-th Catalan number" --code
is equivalent to
llm-play "Write a Python function that computes the n-th Catalan number" --function __FIRST_MARKDOWN_CODE_BLOCK__
In on-the-fly mode, the transformation options selected with -c
are ignored.
Responses can be grouped into equivalence classes based on a specified binary relation. The equivalence relation used for partitioning can be customized via the option --relation
. An equivalence is defined via a builtin function or a shell command. The builtin relation __ID__
checks if two answers are syntactically identical. The builtin relation __TRIMMED_CASE_INSENSITIVE__
ignores trailing whitespaces and is case-insensitive. A relation defined via a shell command holds iff the command exits with the zero status code. For example, this is to group answers into equivalence classes based on a judgement from the qwen2.5-72b-instruct
model:
--relation "llm-play 'Are these two answers equivalent: <answer1>'%%CONDENSED_ESCAPED_DATA1%%'</answer1> and <answer2>'%%CONDENSED_ESCAPED_DATA2%%'</answer2>?' --model qwen2.5-72b-instruct --predicate"
Paritioning can be performed either locally - for responses associated with the same (model, prompt) pair - using the option --partition-locally
, or globally - across all responses - using the option --partition-globally
. For example, this is to partition using a custom relation defined in a Python script:
llm-play --partition-globally data \
--relation `python custom_equivalence.py %%ESCAPED_DATA_FILE1%% %%ESCAPED_DATA_FILE2%%` \
--output classes
When partitioning is performed, the existing equivalence classes are ignored.
Additionally, the option -c
can be used to select a predefined relation when using the options --partition-*
.
A global partitioning w.r.t. the relation __ID__
is performed on-the-fly during LLM sampling.
Predicates are special on-the-fly boolean evaluators. For example, this command acts as a predicate over $CITY
:
llm-play "Is $CITY the capital of China?" --predicate
It first extracts the answer to this question with
llm-play "Is $CITY the capital of China? Respond Yes or No." --answer
If the answer is equivalent to Yes
w.r.t. __TRIMMED_CASE_INSENSITIVE__
, then it exits with the zero status code. If the answer is equivalent to No
, it exits with the code 1
. If the answer is neither Yes
or No
, it exits with the code 2
.
The output of a command with --predicate
cannot be exported with --output
. Predicates can only be applied to commands with a single model/prompt/response.
Data can be written using the --output
option, or read using the --map
and --partition-*
options in the following three formats: FS_TREE
(filesystem tree), JSON
and CSV
. The format is determined by the argument of the above options, which is treated as a directory path unless it ends with .json
or .csv
. Here is a comparison table between these formats.
FS_TREE |
JSON |
CSV |
|
---|---|---|---|
Intended use | Manual inspection | Storage and sharing | Data analysis |
Store prompts? | Yes | Yes | No |
Store responses? | Yes | Yes | Truncated |
Store metadata? | File extension | File extension | No |
FS_TREE
enables running commands for a subset of data, e.g.
llm-play --partition-locally data/qwen2.5-7b-instruct_1.0/a_4ae91f5bd6090fb6 \
--relation __TRIMMED_CASE_INSENSITIVE__ \
--output classes
When data exported into CSV is truncated, the corresponding column name is changed from Sample Content
to Sample Content [Truncated]
. A CSV with Sample Content [Truncated]
cannot be used as an input to --map
and --partition-*
.
To convert between different formats, a transformation with an identity function can used:
llm-play --map data --function __ID__ --output data.json
The shell template language allows dynamic substitution of specific placeholders with runtime values before executing a shell command. These placeholders are instantiated and replaced with their corresponding values before the command is executed by the system shell.
Available placeholders for data:
%%CONDENSED_ESCAPED_DATA%%
- the single-lined, stripped, truncated and shell-escaped text.%%ESCAPED_DATA%%
- the shell-escaped text.%%CONDENSED_DATA%%
- the single-lined, stripped, truncated text.%%RAW_DATA%%
- the original text.
Similarly, RAW_
, ESCAPED_
, CONDENCED_
and CONDENSED_ESCAPED_
variants are provided for the following variables:
%%PROMPT%%
- the prompt content.
The ESCAPED_
variants are provided for the following variables:
%%DATA_FILE%%
- a path to a temporary file containing the data.%%DATA_ID%%
- a unique ID associated with the datum, i.e.<model>_<temperature>_<prompt hash>_<sample id>_<class_id>
.%%PROMPT_FILE%%
- a path to a temporary file containing the prompt.%%PROMPT_LABEL%%
- the prompt label.
For equivalence relation commands, which require multiple arguments, the data and prompt placeholders are indexed, e.g. %%RAW_DATA1%%
and %%PROMPT_LABEL2%%
.
[WIP] The option --debug
prints detailed logs on stderr.
[WIP] To continue an incomplete/interrupted experiment, use --continue
instead of --output
.
llm-play --prompt *.md --continue samples
It will skip all tasks for which there is already an entry in the store. These entries are identified by prompt hashes, but not their labels. In contrast with --output
, --continue
can only be used with a single output.
[WIP] To execute jobs in parallel using 5 workers, use --parallel 5