Random tuning tips

At this point, this is a place to add random one-line reminders to myself about tuning, that other people may also find useful. I'm not an expert; just learning as I go.

Notes on loss rate statistics

Tuning for loss, is not about individual spikes! Find what gives an overall lower curve compared to similar run, over 1 epoch or something similar.
If you just set a stupid low learning rate, you may get a stupid low loss rate. So low loss stats is not the be-all end-all goal.
(That being said, often times changing the LR doesnt seem to affect the loss curve)
General shape of the curve is going to be the same across all runs that use the same training image dataset. Actually, if you stick with the same model, even changing the dataset completely may do nothing to the loss curve. So if you find a dataset that DOES drastically change it: Pay Attention to what changed!!!

Size does matter (Dataset)

I am currently working with two main datasets of interest: a 150image one, and a 600-image one, with the LION optimizer.

They are of similar content, and I am using them with the same training tags.

Both sets did quite well at a learning rate of 1e-06. However, when I tried to push it higher, the 600 image one continued through, while the 150 image one start getting a little... odd.

Learning rates

LoRa vs Finetune

Note that if you want to compare results of training your dataset in a lora, vs a full finetune... they tend to take completely different values for LR. For example, my current dataset likes LR=3e-06 for finetune, but LR=5e-05 for LoRa

Constant vs Linear vs Cosine schedulers

Non-Constant Schedulers adjust the learning rate over a number of steps. How much varience, and how fast it changes, depends on your settings.

Remember that "cosine with hard restart" has a completely different LR curve than regular cosine. (In the graph, red=warm restart, blue= hard restart)
Both Linear, and Cosine (with 1 Learning Rate Cycle) aim to scale learning rate to ZERO at the end of your defined epochs. So if you only want a run of 10 epochs, but dont want to waste training time with LR close to 0, you may want to tell your training software you want 30+ epochs, then stop it at 10.

Effects of Learning Rate on LION

Doing extensive comparison with my "Anime Style" dataset on Juggernaut and LION optimizer, I noticed the following:

LION converged strongly, unlike many other combinations.
With my dataset, I didnt need or want to use EMA below 1e06.
Above and including 1e06, EMA is desirable.
Increasing LR improved the anime style effect on the main subject up until around 1e06, and maybe 8000 steps, but then nothing much changed even up through 1e05
(edit: for some subjects, anyway. But for certain other prompts, they didnt take on anime style fully until I pushed LR to at least 2e06)
Increasing LR from 1e06 to 1e05 (with EMA!!) didnt change much about the main subject... but it DID bring in more background, as anime style

Optimizer comparisons

Different "families" of optimizers need different learning rates
TIGER is best I have tried so far for general case style tuning (2024/08/14)

Prodigy optimizer vs others

In many (but not all) situations, Prodigy calculates its magic so that it "converges" in the number of total steps you define up front. For example, right now, I am doing Lora tuning with 50 epochs.

Things dont truly come together as I expect until the final few epochs. If I change it to 100 epochs.. it still doesnt reach optimal state until perhaps 90.

In contrast, some other optimizers reach a state that looks fairly good, somewhere in the middle of the run... and then stay fairly close to whatever that is for the rest of the run.

Optimizer tips

In theory, you can use dadapt-adam to work out optimal learning rate for adam. Similarly, you can use dadap-lion for lion values. Notes:
- Use Constant scheduler when you are doing this
- Use Tensorboard to figure out what rate they eventually adapt to
- You need to run at least a few thousand steps for it to really get adaptive. The longer you let it run, the more dialed-in the LR value will be
- If you take the final LR value and plug it into non-adaptive version, you will not get the same results!!
  Seems like the adaptive versions do a little more magic somehow. Or perhaps it is due to the LR not being constant over the whole run. The main takeaway from this being:
  The final LR value It is just a quick-n-easy way to find an initial value for experimentation with the non-adaptive version of the optimizer.
- If you change from adaptive to non-adaptive, you may benefit from enabling either "weight decay" in the optimizer settings, or adding on EMA. Otherwise, the tail end of the training run may get messy.

EMA use and tuning

Best EMA update step is probably 5%-15% number of steps in epoch, IF YOU ARE DOING SMALL SCALE TRAINING ONLY.
Changing batch size may need change of EMA update step number
If batchsize >1, probably need EMA scaling. Trainer doesnt have EMA scaling? Then dont increase batch size.
cf: https://arxiv.org/abs/2307.13813
(New theory... maybe half the update step when you double batch size?)
WARNING: EMA may take away fine details? (at least with bf16 and early steps...?) But it seems like it may still be better than large batch sizes, for getting more details in? Test on your own dataset to be sure.

Theory of scaling EMA step value vs batch size:

Optimal value may be roughly "100 images". So, try dividing 100/batchsize, and then play around a bit.

# Guestimates...
batch 2= 50
batch 4= 25
batch 8= 12-13
batch16= 6-7
batch32= 3-4
batch64= 2-3

Beyond that, it probably requires some EMA tuning beyond just adjusting update step count.

See also https://civitai.com/articles/7521/my-saga-of-fine-tuning-sdxl-models#ema-vs-epochs-odjp68ge4 (section "EMA update step vs epochs") for a specific example of what happens when you tweak this value.

BF16 vs float32

A bf16 model is "good enough" for 1k x 1k images most of the time. And its 4x faster to train.
That being said, for BEST results... play around with your training dataset for multiple iterations using bf16 training size. But when you have nailed down your dataset... you may want to training a "final" version in float32.
Even if you save the end results in bf16 anyway... you may lose some fine details unless you train in float32.
Sometimes it seems worth it to even import a model in bf16, then train in f32. Current on-going experiements to train custom anime style seem to benefit from this
SDXL training on 24GB vram, I can use "default" attention for bf16.
For float32, however, I have to use "xformers" attention, or it wont fit.

Loss function, BF16, ...

chatGPT o3 has this to suggest, for general case, large scale finetunes, in 2025 on an rtx 4090
(Note use of batch_size 64!)

loss_type:        min_snr        # activates Min‑SNR
snr_gamma:        5.0            # HF & paper default; try 2‑3 if you see over‑sharpening
prediction_type:  epsilon        # keep SD 1.5 default
precision:        bf16           # bf16 parameters, fp32 fallback for matmuls
grad_scaler:      enabled        # avoids bf16 underflow
batch_size:       64

Batch size

Theory: larger batch sizes are good for TIGER and LION.. but only if you can have an effective stepcount-per-epoch of 100 or more. (So, batch=16 only good if you have 1600+ images)
If you move from low batchsize (ie: "1") to higher, you may need to increase LR.
Similarly, if you move from batch=8 to batch=1, you may need to lower LR to see similar results
As mentioned in the EMA section, beware when you increase batch size if you are using EMA.

VRAM management

Fragmentation problems

If you are doing long runs, sometimes even though you make it through a full epoch just fine.. it may bomb on epoch 3,4, or more due to fragmentation. (It may mention this in the copious output, along with the suggestion below)
If so, setting this environment variable may indeed make the problem go away:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

(sorry to windows users... you'll have to look up how to do that on your systems)

Using a tagger on realworld photos

(A tagger is a caption model that outputs only "tags", not natural language descriptions)

WD tagger

Annoyingly, "WD" is the only ai model I know of, that reliably produces one-word tags.

"Annoyingly" because it is optimized for anime ("WaifuDiffusion") and so some of the things it outputs are unneccessary, or even wrong.

Here's the setting I use right now in taggui, to skip those:

wd_tagger_tags_to_exclude="1girl,k-pop,asian,cosplay,photorealistic,photo background,realistic,solo,multicolored hair,real world location,horror (theme)"

Do remember to then add in "woman", since 1girl is no longer present.

Multiple versions of WD

Keep in mind that there are now many sub versions of WD14. It is currently up to v3, and even then, has two vastly different versions.

You should probably try out both on your specific dataset to see which one gives you results you like more.

Yolo tagger

I just (re)discovered that YOLO is REALLY good at real world tagging.... except that it refuses to identify gender. Sigh. So in theory, could tag everything with both WD and yolo, then merge just the gender from wd.
Except that wd isnt THAT good at identifying gender.

"Bad skin"

Doing a query via a light model like moondream, for "does the person have bad skin?" can identify lots of problematic images.

Whats with the outliers

Not a tip, but a question to myself:
Whats with the random outlier epoch results?
These are samples from epochs 8,9,10,11.
AI training is weirrrddd...

Comparison of batch size, accum, LR, and Loss

For these comparisons, I am using FP32, DADAPT-LION.

Same settings and dataset across all of them, except for batch size and accum.

Analysis

Note that D-LION somehow automatically, intelligently adjusts LR to what is "best". So its nice to see it is adjusting basically as expected: LR goes higher, based on the virtual batch size.
Virtual batch size = (actual batchsize x accum)

I was surprised, however, to see that smooth loss did NOT match virtual batch size. Rather, it seems to increase based on the accum factor

Similarly, it is interesting to note that the effective warmup period chosen by D-LION, appears to vary by accum factor, not strictly by virtual batch size, or even physical batch size.

Personal tuning tips