Stop training steps above 250 for any resolution above 1024x1024

Many image-based neural networks rely on the L2 loss, also known as Mean Squared Error (MSE). This is largely because MSE heavily penalizes large differences between the network's prediction and the target image. At the same time, it is more lenient with smaller errors. The practical benefit is that MSE encourages the model to find an overall "best fit" output. That usually helps the model avoid overfitting to random details in the data because the model is pushed to balance out errors across the entire image. As a result, you often end up with smoother outputs that generalize decently in many scenarios.

However, if you train a model at a certain resolution, such as 512x512, and then upscale its outputs to 1024x1024, you might encounter an unexpected beneficial side effect. Due to the leniency of the error at the lower resolution, the model never explicitly learns to avoid adding finer details at the higher scale. For instance, if the model successfully learns how to render close-up details of a hand at the lower resolution, training it explicitly at higher scales might inadvertently degrade quality by introducing artifacts that propagate upwards. This phenomenon explains why a model like pony performs better at latent upscaling compared to Illustrious-XL.

Why L1 Can Help

L1 loss, sometimes called Mean Absolute Error (MAE), measures the absolute difference between the predicted and target pixel values. In practice, L1 loss does not allow the model to be as relaxed with small deviations. As a result, it forces the network to pay more attention to each pixel difference, leading to outputs that can appear sharper, especially in higher-resolution tasks.

Yet, there is a trade-off. When you adopt L1, you might lose some overall "richness" in the images because the model is not averaging out smaller errors anymore. Instead, it is forced to correct each pixel more aggressively, which can make the images look flatter or less dynamic if the underlying data is noisy or has subtle variations.

Finetuning as a Middle Ground

One compromise is to train the model primarily with MSE at a baseline resolution, then finetune (further train) it at a higher resolution using L1. The initial MSE-based training phase helps the model learn a broad structure that generalizes well. Then, during finetuning, the L1 loss drives the model to pay more attention to detail, enhancing high-resolution fidelity without having to learn everything from scratch.

This approach can help avoid the pitfalls of purely using MSE or purely using L1 from the start. The model gets both the smoothness of MSE training and the sharper detail from the L1 finetuning stage.

Summary and Practical Tips

MSE is great for encouraging your model to be broadly accurate. It emphasizes large errors, pushing the model toward an overall stable output.
L1 loss makes each pixel's difference matter. You will often get sharper, more detailed images at the risk of them sometimes looking overly "flat" or "dry."
If you want to train at one resolution and then jump to a higher resolution, consider finetuning with L1. This can help the model handle those high-resolution details without having to start training from scratch.
Always check the results visually and quantitatively. Metrics are not everything. The final images need to meet aesthetic or domain-specific quality standards.

In practice, it often comes down to experimentation. Different datasets respond differently to each loss function. You might discover that a hybrid approach, such as gradually mixing in L1 or switching from MSE to L1 after some iterations, works best.

Remember that in all of this, the nature of your dataset, your model architecture, and your ultimate goals play huge roles in deciding which loss function (or combination) yields the best results.