The Art of Consistency — Episode 4

Evolution and Insight

How did I get here? Now I’m going to try and piece together the past week of experiments in The Art of Consistency.

My Normal Process

create a prompt generator
generate prompts
generate images from the prompts
use the prompts saved as text files paired up with the images they generated as LoRA training data
train a LoRA using the image/text pairs
use the same prompt generator to generate new prompts
use the new prompts and the newly trained LoRa to create a new set of images

LoRA Training Tags

When something is not just right, or I start to realize there is some sort of disconnect, I typically do not know exactly what the problem is up front. This will lead to a series of experiments as my brain seeks for a way to connect the problem with my conscious mind.

That may sound completely ridiculous but it is as close to reality as I can get right now in explaining how my brain works.

Here is the problem with my “normal process.” After generating the first set of images, to which the generated prompts will be paired for LoRA training, there can be disconnects between what the prompt asked for and what stable diffusion actually produced in the generation process.

example prompt: elderly man, who is spry on his feet and a youthful soul

image example 1: young girl, jumping in air, mystical appearance
image example 2: old woman, laughing while on trampoline
image example 3: old man, playing fetch with his dog

Image example 3 is the closest to what the prompt actually asked for, while still not exactly matching the prompt that was used to generate the image.

Why Does This Occur?

Knowledge, bias, and prior training of the base model, checkpoint, LoRA, sampler, and even the way the prompt was constructed.

Each of the above components were taught from what came before it, the previous version so to speak. A LoRA may have 1000 images or less that constitutes its training data, where every image had words attached to it that was used to train the way the LoRA can read a prompt and adjust the checkpoint. Every checkpoint can have hundreds of thousands of images and words that were used to finetune the base model. Base models should have millions of images and words from which its data set is constructed.

If 60% of those images are female and only 40% are male, there is a large chance the data will be skewed towards female or feminine features. If the words used to tag the images are not correct or detailed with respect to the image the words represent, the model can learn to identify concepts incorrectly and thus will reproduce those concepts incorrectly.

I could keep explaining but its more of the same concepts in either smaller or larger data sets, so I will move on.

The Solution

I originally created a customGPT that is capable of taking an image and generate a set of tags given specific instructions. Since I am working with people, I had a set of categories I wanted filled out.

gender, age, ethnicity, skin tone, height, expression, hair color……yadayada.

The customGPT works very well, I upload an image, and within a minute or two I receive a well-structured prompt, that actually represents the image. Which is very different from using the prompt that generated the image in the first place which was interpreted by different levels of bias in the generation process.

This method is not without its own inherent problems. ChatGPT does not like to talk about images of

“children, nudity, adult themes, ethnicity,…”

I don’t know the extent of this list but I can gather it from ChatGPT when I have him check my work before I submit the article, lol.

“OpenAI’s models — including ChatGPT — are designed with safety filters that limit discussions around certain sensitive topics, especially when images are involved. These include, but aren’t limited to, children, nudity, explicit content, ethnicity, and other identity-related descriptors. The intent behind these filters is to prevent misuse and protect against harmful outputs, but they can sometimes create friction for artists and researchers trying to build inclusive, diverse datasets. Striking the right balance between ethical guardrails and creative freedom is still an evolving challenge in AI design.” -ChatGPT 4o

The problem here is when trying to achieve a diversive or descriptive set of data, which in my case is a goal of creating a people generator where people from all over the world can generate people that actually look like them. This requires skin tone, ethnicity, ethnic or traditional clothes styles, and people of all ages.

In OpenAI’s attempt to regulate images of children or discrimination based off of skin color or ethnicity, they have specifically made it harder to be more inclusive because of the system’s refusal at times to provide tags on these topics.

When working with large data sets, or even relatively small data sets of 50 to 100 images, the process of sending an image to the CustomGPT and waiting for a response, is very time consuming. In addition, I need to copy and paste the data into a text file, save it as the same name as the image file, in other words it’s incredibly laborious and not at all feasible with meaningfully larger data sets.

To solve for these issues, the process needed to evolve. What follows is a discussion of that evolution.

The Evolution

I have not made much use of OpenAI’s API, not from a lack of interest but from a lack of an idea sufficiently good to warrant the cost. I am not a wealthy person by any means, and I spend a good deal of the money I have on AI-related interests with a hope that one day my work will be noticed.

I took my customGPT instructions and wrote a Python script that would use those instructions as the system instructions to communicate with OpenAI’s API. For every image in a folder, the script will send the image to the API and receive back structured json data that is saved with the same name as the image. It then flattens the json data into a single line and saves it as a text file with the same name as the image.

I now have three folders, the image folder, the json folder, and the text folder. Once all of the images in the image folder are processed, each of the jsons are compiled into a single master json containing all of the tags given by OpenAI’s chat completion endpoint. Finally, all of the images and text files will be zipped into a single file ready to be used for LoRA training.

It isn’t cheap, but the quality of the tags for LoRA training is superb. No longer do I have green hair tagged as yellow. No longer do I have elderly people tagged as adults.

The generalization that would occur from:

improperly trained models, checkpoints, and loras
how my prompts were interpreted
the resultant generated images

was no longer present in my training data set. Instead I had a set of tags for each image that only represented what was actually in the image, instead of what I originally wanted in the image.

This was a huge evolution in my thought process.

What I Did Next

I took 500 images that I had previously generated consisting only of men, because more often than not data sets are skewed towards women, bias created by the following imbalance:

The tech industry is largely dominated by men, with women making up only about 25–28% of the workforce, and this disparity is even more pronounced in leadership roles. This gender gap is evident from early education choices to career progression and can impact pay, opportunities, and overall workplace culture.

-Gemini

and

While precise figures vary depending on the study and methodology, estimates suggest that around 85–90% of men identify as heterosexual, while 2–6% identify as gay or bisexual.

-Gemini

which I would assume means that more often than not, when a model, checkpoint, or LoRA is created, approximately 75% of the time it is by a man, and approximately 85–90% of those men identify as heterosexual, and therefore would be more likely to generate images of women than men. This is definitely an assumption, but I believe it is supported by my observations over the last two and half years of working with image generation via AI.

I generated 500 images, focusing only on men, in an effort to diversify the available data from which to generate images of men. From those 500 images I used my Python script and OpenAI’s API to generate tags relevant to each image, and proceeded to train a LoRA on the new dataset.

The LoRA I described above is going to be better at generating accurate images of men than any of my LoRAs to date. I don’t know this for a fact because I have not tested it, but I am fairly certain this will be the case. I will discuss the outcome in a following episode of The Art of Consistency.

The reason I have not yet tested the LoRA is because I got sidetracked by further insight into this process, which I will describe next.

The Insight

While the LoRA was training, I began to think about this article that I am constructing currently. I thought about my goals of consistent character generation and how the LoRA trained on 500 images of men would help me in that endeavor. I came to the realization that I could advance my idea further using data that I had already generated.

In the first Episode, I showed 80 images of a young woman with short red hair and freckles. The images were presented as groups of three images generated from a single prompt. How can I take my new insights and improve upon her consistency? If I was to use the API to create tags and train a LoRA only with her images, could I create more consistency in her generation?

The problem with the process I used with the 500 images of men is that I am tagging both physical characteristics and changeable characteristics. This is great for generating new characters, because you can combine different physical characteristics in different ways to obtain new characters.

The goal here is different, we want the physical character to remain the same, while the clothes, hair style, environment, lighting, camera angle, etc., change, just like in real life. How can I solve the issue of this difference in perspective?

The Answer

I needed to adjust the system instructions that are sent to the API. The goal is to leave out all physical characteristics and only tag the changeable items. I also need to add an activation phrase to every set of tags for each image. This activation phrase would serve as a location for all non-tagged data to be trained. In other words the activation phrase would in effect contain all of the physical characteristics of that singular character, averaged and synthesized into a single phrase.

The requirements for this to work are a set of images that all roughly approximate the same individual for best results. I already had 80 images of a character from the first episode, so I reran the Python script to generate tags (minus the physical tags) on those 80 images. I trained a new LoRA adding in an activation phrase to each image.

After the LoRA finished training, I generated 50 prompts from the master json containing all tags used in training and added the activation phrase to each prompt as the first concept phrase or tag. I generated a set of three images for each of the 50 prompts. What follows are the 150 images generated using this technique in The Art of Consistency.

Generation Parameters

All images were generated in Stable Diffusion using an SDXL checkpoint. I am using the Automatic1111 WebUI. The checkpoint and vae can be downloaded from the provided links. Everything else is manageable from within the Auto1111 UI.

SD Checkpoint: XL\albedobaseXL_v31Large.safetensors [c379d154eb]

SD VAE: sdxl_vae.safetensors

Sampler: EulerA

Schedule: Automatic

Sampling Steps: 35

Width: 832

Height: 1216

Batch Count: 3

CFG: 7

Seed: -1

LoRA: artfullyANNIE.PEOPLEPROJECT_V1_SDXL

LoRA Strength: 0.5

LoRA Activation Phrase: PP.ANNIE

The LoRA ANNIE.PEOPLEPROJECT I have not yet made available to the public as I am still exploring its possibilities. If and when I do, I will update this and all other articles which use the LoRA with a link for download. If you want access to the LoRA drop a message in the comment section.

Link to Article on Medium

The only difference is the ability to see the 150 generated images.

https://artfullyprompt.medium.com/the-art-of-consistency-episode-4-e11ae2a0d812

👥 The link above is for Medium members. If you're not a Medium member and want to read the article for free, just drop your name below and follow me—I'll DM you a friend link so you can check it out for free!

Conclusion

My assumptions were correct. The activation phrase did serve as a container for the character’s physical characteristics — no question about it. Looking through the 150 generated images, it’s obvious that the same character is present in every one. But unlike Episode 1, where she sometimes felt stiff or inconsistent, she now feels believable. There’s more variety in her expressions, poses, clothing, and even mood. She looks like the same person captured across different moments, instead of just a loosely related visual theme.

This approach not only gave me stronger results, but more confidence in the method moving forward. In Episode 5, I’ll return to the male-only LoRA trained on 500 images and test how well it performs when building out new characters. If what I’ve seen so far is any indication, this is only the beginning of something much bigger.

The Art of Consistency — Episode 4: Evolution and Insight