Starred repositories
A Datacenter Scale Distributed Inference Serving Framework
High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.
Efficient and easy multi-instance LLM serving
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
A Lazy, high throughput and blazing fast structured text generation backend.
Efficient LLM Inference over Long Sequences
simplest online-softmax notebook for explain Flash Attention
KV cache compression for high-throughput LLM inference
A throughput-oriented high-performance serving framework for LLMs
DSPy: The framework for programming—not prompting—language models
Deploy high-performance AI models and inference pipelines on FastAPI with built-in batching, streaming and more.
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
A lightweight UI for interfacing with the Zoo text-to-cad API, built with SvelteKit.
Generate music based on natural language prompts using LLMs running locally
SGLang is a fast serving framework for large language models and vision language models.
Unified KV Cache Compression Methods for Auto-Regressive Models
A framework for few-shot evaluation of language models.
Universal LLM Deployment Engine with ML Compilation
Tools for merging pretrained large language models.
A deployment, monitoring and autoscaling service towards serverless LLM serving.
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
[NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an …
Code for Husky, an open-source language agent that solves complex, multi-step reasoning tasks. Husky v1 addresses numerical, tabular and knowledge-based reasoning tasks.