DeepFloyd’s custom filters were used to remove watermarked, NSFW and other inappropriate content.Īs a new model, we are initially releasing DeepFloyd IF under a research license. LAION-A is an aesthetic subset of the English part of the LAION-5B dataset and was obtained after deduplication based on similarity hashing, extra cleaning, and other modifications to the original dataset. Note: We have not released this third-stage model yet however, the modular character of the IF model allows us to use other upscale models – like the Stable Diffusion x4 Upscaler – in the third stage.ĭeepFloyd IF was trained on a custom high-quality LAION-A dataset that contains 1B (image, text) pairs. The final third stage model IF-III has 700M parameters. Stage 3: The second super-resolution diffusion model is applied to produce a vivid 1024x1024 image. Again, several versions of this model are available: IF-II 400M and IF-II 1.2B. The first of these upscales the 64圆4 image to a 256x256 image. Stage 2: To ‘amplify’ the image, two text-conditional super-resolution models (Efficient U-Net) are applied to the output of the base model. The DeepFloyd team has trained three versions of the base model, each with different parameters: IF-I 400M, IF-I 900M and IF-I 4.3B. This process is as magical as witnessing a vinyl record’s grooves turn into music. Stage 1: A base diffusion model transforms the qualitative text into a 64圆4 image. The diffusion is implemented on a pixel level, unlike latent diffusion models (like Stable Diffusion), where latent representations are used.Ī text prompt is passed through the frozen T5-XXL language model to convert it into a qualitative text representation. The process starts with a base model that generates unique low-resolution samples (a ‘player’), then upsampled by successive super-resolution models (‘amplifiers’) to produce high-resolution images.ĭeepFloyd IF’s base and super-resolution models are diffusion models, where a Markov chain of steps is used to inject random noise into data before the process is reversed to generate new data samples from the noise.ĭeepFloyd IF works in pixel space. We break down the definitions of each of these descriptors here:ĭeepFloyd IF consists of several neural modules (neural networks that can solve independent tasks, like generating images from text prompts and upscaling) whose interactions in one architecture create synergy.ĭeepFloyd IF models high-resolution data in a cascading manner, using a series of individually trained models at different resolutions. This approach gives the opportunity to modify style, patterns and details in output while maintaining the basic form of the source image – all without the need for fine-tuning.ĭeepFloyd IF is a modular, cascaded, pixel diffusion model. The style can be changed further through super-resolution modules via a prompt text description. Image modification is conducted by (1) resizing the original image to 64 pixels, (2) adding noise through forward diffusion, and (3) using backward diffusion with a new prompt to denoise the image (in inpainting mode, the process happens in the local zone of the image). The ability to generate images with a non-standard aspect ratio, vertical or horizontal, as well as the standard square aspect. This property is reflected by the impressive zero-shot FID score of 6.66 on the COCO dataset (FID is a main metric used to evaluate the performance of text-to-image models the lower the score, the better). Until now, these use cases have been challenging for most text-to-image models. Incorporating the intelligence of the T5 model, DeepFloyd IF generates coherent and clear text alongside objects of different properties appearing in various spatial relations. A significant amount of text-image cross-attention layers also provides better prompt and image alliance.Īpplication of text description into images: The generation pipeline utilizes the large language model T5-XXL-1.1 as a text encoder. In line with other Stability AI models, Stability AI intends to release a DeepFloyd IF model fully open source at a future date. Today Stability AI, together with its multimodal AI research lab DeepFloyd, announced the research release of DeepFloyd IF, a powerful text-to-image cascaded pixel diffusion model.ĭeepFloyd IF is a state-of-the-art text-to-image model released on a non-commercial, research-permissible license that provides an opportunity for research labs to examine and experiment with advanced text-to-image generation approaches.
0 Comments
Leave a Reply. |