Abstract: Recent advancements in text-to-image diffusion models have shown impressive results but often fail to fully capture user’s intent. Existing methods combining textual inputs with bounding ...