Skip to content
/ waifu Public
forked from NVlabs/Sana

Waifu Image Synthesis with Linear Diffusion Transformer

License

Notifications You must be signed in to change notification settings

recoilme/waifu

 
 

Repository files navigation

logo

⚡️Waifu: Efficient High-Resolution Waifu Synthesis

teaser_page1

Train in progress!

logs

Prompt: 1girl, solo, animal ears, bow, teeth, jacket, tail, open mouth, brown hair, orange background, bowtie, orange nails, simple background, cat ears, orange eyes, blue bow, animal ear fluff, cat tail, looking at viewer, upper body, shirt, school uniform, hood, striped bow, striped, white shirt, black jacket, blue bowtie, fingernails, long sleeves, cat girl, bangs, fangs, collared shirt, striped bowtie, short hair, tongue, hoodie, sharp teeth, facial mark, claw pose

19.12:

20.12:

Burned money: ~$1000 Pls, let' us know if you have some money or GPU for training opensource waifu model, contacts: recoilme

💡 Introduction

tldr; We just need a model to generate waifu

We introduce Waifu, a text-to-image framework that can efficiently generate images up to 768 × 768 resolution on 80+ languages. Our goal was to create a small model that is easy to full finetune on custom GPU, but without compromising on quality. It's like a SD 1.5, but developed in 2024 using the most advanced components at this moment. Waifu can synthesize high-resolution, high-quality images of waifu with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU.

Core designs include:

(1) AuraDiffusion/16ch-vae: A fully open source 16ch VAE. Natively trained in fp16.
(2) Linear DiT: we use 1.6b DiT transformer with linear attention.
(3) MEXMA-SigLIP: MEXMA-SigLIP is a model that combines the MEXMA multilingual text encoder and an image encoder from the SigLIP model. This allows us to get a high-performance CLIP model for 80 languages..
(4) Other: we use Flow-Euler sampler, Adafactor-Fused optimizer and bf16 precision for training, and combine efficient caption labeling (MoonDream, CogVlM) and danbooru tags to accelerate convergence.

As a result, Waifu-2b is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Waifu-2b can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 768 × 768 resolution image. Waifu enables waifu creation at low cost.

Example

    1  apt update
    4  git clone https://github.com/recoilme/waifu
    5  cd waifu/
    6  pip install -e .
    7  pip install flash-attn --no-build-isolation
    8  cd ..
   13  python waifu/train_scripts/make_buckets_new.py --h
   14  python waifu/train_scripts/make_buckets_new.py --config_path waifu/configs/sana_config/576ms/waifu-2b-576.yaml --load_from waifu-2b-v01.pth 
   15  cd waifu
   21  nvidia-smi
   25  accelerate config
   61  nohup accelerate launch train_scripts/train_waifu.py --config configs/sana_config/576ms/waifu-2b-576.yaml --name 33 --load_from /workspace/waifu-2b-v01.pth &
   62  tail -f nohup.out

// AiArtLab team

About

Waifu Image Synthesis with Linear Diffusion Transformer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 62.1%
  • Jupyter Notebook 28.9%
  • HTML 6.3%
  • Shell 2.7%