Stable Diffusion

[์›๋ณธ ๋งํฌ]

Stable Diffusion์€ ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋ธ ์ค‘์—์„œ๋„ ๊ฐ€์žฅ ํ•ต์‹ฌ์ ์ธ ์œ„์น˜๋ฅผ ์ฐจ์ง€ํ•˜๊ณ  ์žˆ๋Š” ์˜คํ”ˆ์†Œ์Šค ๋ชจ๋ธ์ด๋‹ค.
2022๋…„์— ์ฒ˜์Œ ๋‚˜์™”๊ณ , Stability AI๋ผ๋Š” ๊ธฐ์—…์—์„œ ๋งŒ๋“ค์—ˆ๋‹ค.

๋Œ€๊ธฐ์—…๋“ค์˜ ํ์‡„ํ˜• ๋ชจ๋ธ์„ ์ œ์™ธํ•˜๊ณ  ์˜คํ”ˆ์†Œ์Šค ๋ชจ๋ธ ์ค‘์—์„œ๋Š” ์••๋„์ ์ธ ์œ„์น˜๋ฅผ ์ฐจ์ง€ํ•˜๊ณ  ์žˆ๋‹ค. ์„ฑ๋Šฅ๋„ ๊ฐ€์žฅ ์ข‹๊ณ , ์†๋„๋„ ๋น ๋ฅด๋‹ค.

๋‹จ, ๋ฌด๋ฃŒ์ด๊ธด ํ•œ๋ฐ, ์ด๊ฑด ๋น„์ƒ์—…์šฉ์— ํ•œ์ •ํ•œ ๊ฒƒ์ด๋‹ค. ์ƒ์—…์šฉ์œผ๋กœ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ๋น„์šฉ์„ ๋‚ด์•ผ ํ•œ๋‹ค.




๋™์ž‘ ์›๋ฆฌ

Stable Diffusion์˜ ํฌ๊ฒŒ 3๊ฐ€์ง€ ๋ถ€๋ถ„์œผ๋กœ ๋‚˜๋‰œ๋‹ค.
์‹คํ–‰ ์ˆœ์„œ๋Œ€๋กœ CLIP, UNet, VAE(Variational Auto Encoder)๋‹ค.

์ฒซ๋ฒˆ์งธ, CLIP(Contrastive Language-Image Pre-training)์€ ์œ ์ €์˜ ํ…์ŠคํŠธ๋ฅผ ๋ฐ›์•„์„œ ๋ชจ๋ธ ๋‚ด๋ถ€์—์„œ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ๋ฒกํ„ฐ ์ž„๋ฒ ๋”ฉ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์—ญํ• ์„ ํ•œ๋‹ค. ๋ฒกํ„ฐ์˜ ์ฐจ์›(๊ธธ์ด)์€ ๋ชจ๋ธ๋งˆ๋‹ค ๋‹ค๋ฅด๋‚˜, ๋ณดํ†ต 768์ด๋‚˜ 1024 ์ •๋„๋‹ค.
CLIP์—์„œ ๋งŒ๋“ค์–ด์ง„ ๋ฒกํ„ฐ๋Š” UNet์œผ๋กœ ์ „๋‹ฌ๋ผ์„œ ์ด๋ฏธ์ง€ ์ƒ์„ฑ์„ ์œ„ํ•œ ์ž…๋ ฅ๊ฐ’์œผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค.

UNet์€ CLIP์ด ๋งŒ๋“  ๋ฒกํ„ฐ ๊ฐ’์„ ๋ฐ›์•„์„œ ์ด๋ฏธ์ง€ ์ƒ์„ฑ์„ ์‹œ์ž‘ํ•œ๋‹ค.
๋ชจ๋ธ๋‹ต๊ฒŒ ๋™์ž‘ ๋ฐฉ์‹ ์ž์ฒด๋Š” ์ •ํ˜•์ ์ด์ง€ ์•Š๋‹ค. ๊ตณ์ด ์„ค๋ช…์„ ํ•˜์ž๋ฉด, "๋žœ๋ค ๋…ธ์ด์ฆˆ"๋ฅผ ์ตœ์ดˆ ํ•œ๋ฒˆ ์ดˆ๊ธฐํ™”ํ•œ ๋‹ค์Œ์—, ๊ณ„์†ํ•ด์„œ ๋ฃจํ”„๋ฅผ ๋Œ๋ฉด์„œ ๋…ธ์ด์ฆˆ๋ฅผ ์ค„์ธ๋‹ค.
์ด ๋…ธ์ด์ฆˆ๋ฅผ ์ค„์ด๋Š” ๊ฒƒ ์ž์ฒด๊ฐ€ ์‚ฌ์ „ ํ•™์Šต๋œ ํŒจํ„ด์— ๋”ฐ๋ผ์„œ ๋ฒกํ„ฐ ์ž…๋ ฅ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ˆ˜ํ–‰๋˜๋Š” ๊ฒƒ์ด๊ณ , ๋…ธ์ด์ฆˆ๊ฐ€ ์ถฉ๋ถ„ํžˆ ์ค„์–ด๋“ค์—ˆ๋‹ค๊ณ  ํŒ๋‹จํ•˜๋ฉด ๋…ธ์ด์ฆˆ๊ฐ€ ์ตœ์†Œํ™”๋œ ๊ฒฐ๊ณผ๋ฌผ์„ ๋‹ค์Œ ์Šคํ…์ธ VAE์— ๋„˜๊ธฐ๋Š” ๊ฒƒ์ด๋‹ค.

์‚ฌ๋žŒ์ด ๊ทธ๋ฆฌ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋„ํ™”์ง€์—์„œ ๊ทธ๋ ค๋‚˜๊ฐ€๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ, ๋”๋Ÿฌ์šด ๋‚™์„œํŒ์— ์ง€์šฐ๊ฐœ์™€ ์—ฐํ•„์„ ์จ์„œ ๊ทธ๋ฆผ์„ ์™„์„ฑํ•˜๋Š” ์…ˆ์ด๋‹ค.

VAE(Variational Auto Encoder)๋Š” UNet์˜ ๊ฒฐ๊ณผ๋ฌผ์„ ๋ฐ›์•„์„œ ์ง„์งœ ์ด๋ฏธ์ง€(PNG) ๋“ฑ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.
๊ตฌ์ฒด์ ์œผ๋กœ ์˜ˆ๋ฅผ ๋“ค๋ฉด, UNet์ด 64ร—64ร—4 ํฌ๊ธฐ์˜ ์••์ถ•๋œ latent ํ‘œํ˜„์„ VAE์— ๋˜์ง€๋ฉด, VAE Decoder๋Š” ๊ทธ๊ฑธ 512ร—512ร—3 ํฌ๊ธฐ์˜ ์‹ค์ œ ํ”ฝ์…€ ์ด๋ฏธ์ง€๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.
์‹ค์ œ๋กœ ๋งŒ๋“ค์–ด์ง€๋Š” ์ด๋ฏธ์ง€ ํฌ๊ธฐ์— ๋น„ํ•ด์„œ UNet์—์„œ ์—ฐ์‚ฐํ•˜๋Š” ์ด๋ฏธ์ง€์˜ ํฌ๊ธฐ๋Š” ๋งค์šฐ ์ž‘์€ ํŽธ์ด๊ณ , ๊ทธ๋ž˜์„œ ์•ฝ๊ฐ„์˜ ์†์‹ค์ด ์žˆ์„ ์ˆ˜๋Š” ์žˆ์ง€๋งŒ ์†๋„๋ฅผ ๋น ๋ฅด๊ฒŒ ๋Œ์–ด์˜ฌ๋ฆด ์ˆ˜ ์žˆ๋Š” ๊ตฌ์กฐ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค.




๋กœ์ปฌ์—์„œ ์‚ฌ์šฉํ•ด๋ณด๊ธฐ

StableDiffusion์€ ์˜คํ”ˆ ๋ชจ๋ธ์ด๊ธฐ ๋•Œ๋ฌธ์—, ๊ทธ๋ƒฅ ์ž์œ ๋กญ๊ฒŒ ๋ˆ„๊ตฌ๋‚˜ ๋‹ค์šด๋ฐ›์•„์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
GPU๊ฐ€ ์—†์–ด๋„ ์‹คํ–‰์€ ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ๋ช‡๋ถ„ ์ •๋„๋กœ ๋А๋ฆฌ๋‹ค.

uv์™€ python์„ ์จ์„œ ๊ฐ„๋‹จํ•˜๊ฒŒ ์‹คํ–‰์„ ํ•ด๋ณด๊ฒ ๋‹ค.
ํ•„์š”ํ•œ ์ข…์†์„ฑ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. uv pyproject๋‹ค.

[project]
name = "just-test"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.13"
dependencies = [
    "diffusers>=0.30.0",
    "torch>=2.0.0",
    "transformers>=4.30.0",
    "accelerate>=0.20.0",
    "safetensors>=0.3.0",
    "pillow>=10.0.0",
]

์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

from diffusers import StableDiffusionPipeline
import torch
from datetime import datetime

<br>

def main():
    print("=== Stable Diffusion CPU ํ…Œ์ŠคํŠธ ===\n")

    print("1. ๋ชจ๋ธ ๋กœ๋”ฉ ์ค‘... (์ฒซ ์‹คํ–‰ ์‹œ 4-5GB ๋‹ค์šด๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค)")
    print("   ๊ฒฝ๋Ÿ‰ ๋ชจ๋ธ ์‚ฌ์šฉ: runwayml/stable-diffusion-v1-5")

    # CPU์šฉ ๋ชจ๋ธ ๋กœ๋“œ
    pipe = StableDiffusionPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        torch_dtype=torch.float32,  # CPU๋Š” float32 ์‚ฌ์šฉ
        safety_checker=None,  # ์†๋„ ํ–ฅ์ƒ์„ ์œ„ํ•ด safety checker ๋น„ํ™œ์„ฑํ™”
        use_auth_token=False,  # ์ธ์ฆ ์—†์ด ๊ณต๊ฐœ ๋ชจ๋ธ ์‚ฌ์šฉ
    )
    pipe = pipe.to("cpu")

    # ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”
    pipe.enable_attention_slicing()

    print("โœ“ ๋ชจ๋ธ ๋กœ๋”ฉ ์™„๋ฃŒ!\n")

    # ํ”„๋กฌํ”„ํŠธ
    prompt = "a serene mountain landscape at sunset, beautiful colors, digital art, highly detailed"

    print(f"2. ์ด๋ฏธ์ง€ ์ƒ์„ฑ ์ค‘...")
    print(f"   ํ”„๋กฌํ”„ํŠธ: {prompt}")
    print(f"   ํ•ด์ƒ๋„: 512x512")

    start_time = datetime.now()

    # ์ด๋ฏธ์ง€ ์ƒ์„ฑ
    image = pipe(
        prompt,
        height=512,
        width=512,
        num_inference_steps=20,  # ์†๋„๋ฅผ ์œ„ํ•ด ์ค„์ž„
        guidance_scale=7.5,
    ).images[0]

    end_time = datetime.now()
    elapsed = (end_time - start_time).total_seconds()

    # ์ €์žฅ
    output_filename = f"output_{datetime.now().strftime('%Y%m%d_%H%M%S')}.png"
    image.save(output_filename)

    print(f"โœ“ ์ƒ์„ฑ ์™„๋ฃŒ!")
    print(f"   ์†Œ์š” ์‹œ๊ฐ„: {elapsed:.1f}์ดˆ ({elapsed / 60:.1f}๋ถ„)")
    print(f"   ์ €์žฅ ์œ„์น˜: {output_filename}")

<br>

if __name__ == "__main__":
    main()

์œ„์™€ ๊ฐ™์ด ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋„ฃ๊ณ  ์‹คํ–‰ํ•˜๋ฉด


์กฐ๊ธˆ ๊ฑธ๋ ค์„œ ์ด๋ ‡๊ฒŒ ์ด๋ฏธ์ง€๋ฅผ ๋งŒ๋“ค์–ด์ค€๋‹ค.



์ฐธ์กฐ
https://aws.amazon.com/ko/what-is/stable-diffusion/
https://medium.com/@onkarmishra/stable-diffusion-explained-1f101284484d
https://en.wikipedia.org/wiki/Stable_Diffusion