Text-to-image for my inbox

Prompt: "a torrent of email"

An AI advance I did not expect to happen as quickly as it did: open domain text-to-image synthesis sort of works now. You type in a phrase describing an image you want, and get one generated for you that, a lot of the time, resembles your request.

Like many people, I'm using Stable Diffusion, an open source model about 4 gigabytes in size that was released publicly a few weeks ago.^[1] It can generate images on consumer-level hardware in a matter of seconds if you have a recent enough GPU, or a matter of minutes even on CPU.

People have many opinions about the implications for art. That's an interesting discussion, but not one where I have much to add. I am personally most interested in whether this class of models can be used as component of a generative system for ambient personal visualization. Specifically, I have been using it to visualize my email inbox.

Inspiration: Office Plant #1

Michael Mateas (who I did my PhD with) and Marc Böhlen built a motorized sculpture whose concept I like, Office Plant #1, around 1997. It monitored a user's mailbox, and reconfigured itself based on the contents of the email stream. To quote from the paper's dryly humorous abstract:

Walk into a typical high-tech office environment and among the snaking network wires, glowing monitors, and clicking keyboards, and one is likely to see a plant. Unfortunately, this plant is often dying; it is not well-adapted to the office environment. Here the authors discuss their work Office Plant #1, an exploration of a technological object adapted to the office ecology that fills the same social and emotional niche as a plant. Office Plant #1 monitors both the ambient light level and its owner's email activity. Its robotic body, reminiscent of a desert plant in form, responds with slow, rhythmic movements and quiet ambient sound. Office Plant #1 is a new instantiation of our notion of intimate technology, that is, technologies which address human needs and desires as opposed to technologies which meet exclusively functional task specifications.

You can also see a video here. The plant acts as a kind of information visualization, but filtered indirectly through an algorithmic system. Another system with related motivations is Tableau Machine, which used a summary of activity taking place in Georgia Tech's sensor-laden Aware Home to drive a screen displaying an abstract generative artwork.

Is text-to-image synthesis useful for that kind of indirect, personalized visualization? I've been doing some preliminary investigations.

Inbox-driven image synthesis, attempt 1

Here are 20 randomly chosen recent subject lines from my email inbox, just fed verbatim as Stable Diffusion prompts:

This is actually sort of interesting, especially watching it update periodically. But overall the aesthetic doesn't really work for me. The biggest culprit is that some of the output looks modeled on webpages or powerpoint slides (which probably appear in the training data).

For example, the top-left prompt was "Meeting Reminder", and the first prompt in the second row was "Join Us at AI Defense Forum in Pentagon City, VA". In general the attempts to include text seem out of place and a bit out of keeping with the goal of abstracted visualization; although the extra E in IEEEE is funny.

Inbox-driven image synthesis, attempt 2

Fortunately it is fairly easy to get different aesthetics out of this model by adding some adjectives or styles.

Here are the same 20 randomly chosen subject lines, but with "oil on canvas" appended to the prompt:

Now this is starting to get interesting, even with such minimal prompt engineering!

Making this a more ambient visualization by sticking it on my iPad:

Levels of abstraction

I think there's a lot of potential in using these image synthesis models as (semi-)interactive generative systems, in places where you might have otherwise used a more algorithmic generative system rather than a machine-learned model.

There are some challenges in getting an interesting level of "systematicity" and abstraction. For me, algorithmic and generative art is interesting partly because there is a relationship between input and output that is readable (perhaps with effort), but the relationship is also not too direct and simplisitic. A machine-learned model like this risks producing output that is either too literal on the one hand, or too black-box on the other. But it is interesting that it "automatically" pulls in some representational, semantic content (unlike purely abstract algorithmic art), which has its own benefits.

Going more abstract with still-minimal prompt engineering, instead of "oil on canvas", we could append "abstract painting" and get this instead: There are a few glitches, but this is a decent starting point if you want some abstract paintings as a basis for a generative art system. The bigger challenge is to have them change over time in interesting and readable ways in response to input, which is something you get by construction with algorithmic art written by a programmer.

Finally, the only aggregation process I've been using here is juxtaposition into a collage. Office Plant #1 and Tableau Machine (mentioned above as inspiration) do data preprocessing and aggregation first before using the result to drive a generative art system; I have not experimented with doing that with my email yet.

Endnotes

^[1] How text-to-image systems got here is beyond the scope of this post, but a few links:

Jack Morris (January 2022), The Weird and Wonderful World of AI Art. Mentions the major milestones in 2021, which were driven by a mix of academic researchers releasing new models, and artists on the internet recombining them and improving the generation process in interesting ways.
Lj Miranda (August 2021), The Illustrated VQGAN. A deep dive into VQGAN-CLIP, probably the state-of-the-art method in 2021 among those with available open source implementations. Also worth a look is the Opinionated Tree of Knowledge in the appendix, summarizing how this builds on the broader tradition of unsupervised learning.
Stanislav Frolov et al. (December 2021). Adversarial text-to-image synthesis: A review. Neural Networks vol. 144, pp. 187-209. Covers roughly 2016–2021, focusing on methods based on generative adversarial networks (GANs).