Our paper "Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization" is available at https://arxiv.org/abs/2412.18525.
📚Dataset (Explanatory-based Vison Tasks) | 📚Dataset (Terminological-based Vision Tasks) | 🤗 Model (UVT-7B-448)
We build our code based on Chameleon, Luminar-mGPT and LLaMA2-Accessory.
Since currently the Chameleon implementation in transformers does not contain the VQ-VAE decoder, please manually download the original VQ-VAE weights provided by Meta and put them to the following directory:
- ckpts/
- chameleon/
- tokenizer/
- text_tokenizer.json
- vqgan.yaml
- vqgan.ckpt
- conda create -n py310 python=3.10 -y
- conda install cudatoolkit=11.8 -y
- conda activate py310
- pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
- wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
- pip install flash_attn-2.6.3+cu118torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl --no-build-isolation
- pip install -r requirements.txt
The simplest code for inference (refer to demo_inference.py):
from inference_solver import FlexARInferenceSolver
from PIL import Image
import os
import torch
import random
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
def set_seed(seed):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
inference_solver = FlexARInferenceSolver(
model_path = "UVT_7B_448", #path to your model
precision="fp16", #bf16
target_size=448, #fixed 448
)
max_out = 1
for i in range(max_out):
set_seed(i)
qas = [["Acknowledge the spatial structure and identify variations in light intensity, translating these into a gradient scale representing distances. Accentuate regions where light diminishes gradually, enhancing the perception of depth by dimming peripheral areas. Adjust the distribution of luminance to highlight the central vanishing point, converting detailed textures into smooth transitions of grayscale." + " <|image|>", None]]
images = [Image.open("./demo_input/rain_1.jpg")]
generated = inference_solver.generate(
images=images,
qas=qas,
max_gen_len=4096,
temperature=1.0,
logits_processor=inference_solver.create_logits_processor(cfg=1., image_top_k=2048),
)
new_image = generated[1][0]
new_image.save(f'./test_output_{i}.png', format='PNG')This stage tokenizes each data point, consisting of interleaved image and text, into a single sequence of integer tokens. After tokenization, the sequence is saved to disk for trainining-time usage. Together with the saved tokens, a json-formatted record file is also generated for indexing all the saved token files. For faster tokenization, you may use multiple GPUs and dispatch different subsets of data to them.
We provide pre-tokenization samples in './pre_tokenize'. Running commend can refer to the 'py' files in the folder.We also provide samples for output files, please refer to './json/edit_resolution_448/Allweather'.
You can deal with your own dataset by following these stages, or you can download dataset wo provide in Huggingface and write a simple Pre-Tokenization script.
Please refer to exps/7B.sh
Additional environment for ControlNet++ is required for the evaluation stage.
We provide json for evalution data in 'Evaluation_Data_JSON'. Then you have to follow these json files to generate images and evaluate them respectively. You can also use instructions in the paper (cf. Appendix A) to generate your own data for evalution.
Explanatory Instruction:
"Fill in all the empty outlines with rich colors that reflect vibrant tones, while redefining the shapes with smooth textures. Add layers of depth to the flat contours by enhancing brightness gradients in the sky, shadowing in the mountains, and intricate shades among the flowers. Reintroduce the sensation of open space and dimension by contrasting sharp objects with muted backgrounds and crisp details in the foreground."
Resolution:
448×448.
Explanatory Instruction:
"Slowly remove the rain falling from the sky in the image, still maintain the state of night, and the girl on the bridge is also still holding the umbrella, but readjust the light in the distance."
Limitations:
The model struggles to preserve smaller objects and environmental details.
Resolution:
448×448.
Explanatory Instruction:
"Increase the overall brightness to reveal details in dark areas while preserving highlights. Adjust the contrast to enhance the brightness differences between regions, making the structures and textures more distinct. Optimize color saturation to make previously dull colors more vibrant, such as the blue on the floor becoming more prominent. Apply denoising to reduce noise commonly found in low-light images, improving the overall quality. Ensure the final image appears natural while retaining the authentic style of the scene."
Limitations:
Controlling the intensity of lighting enhancement through language instructions is challenging, often resulting in significant deviations in the output.
Resolution:
448×448.
Explanatory Instruction:
"Remove the falling snow from the sky in the image, keep the other objects and snow in the image, still keep it dark, but pay attention to the adjustment of light behind the tree."
Limitations:
The second generated image struggles to retain nighttime details, while the third and fourth images exhibit poor performance in removing snow from the sky. Additionally, attempting to remove snow from the ground simultaneously can result in significant distortions.
Resolution:
448×448.
Explanatory Instruction:
"The image shows noticeable multiple visual overlaps of trees and buildings. I would like to remove visual overlaps and restore a clear, sharp image without blurring. Do not alter the main content and pay attention to adjusting the light."
Limitations:
The success rate of guiding the model's task-level zero-shot capability through language instructions is relatively low.
Resolution:
448×448.
Explanatory Instruction:
"Retain the distant clouds in the image while removing as much fog as possible. Attempt to restore the faintly visible sun in the distance, but ensure there is no strong sunlight. Focus on recovering the mountains and the nearby trees as much as possible."
Limitations:
It will cause distortions in certain objects.
Resolution:
448×448.













































































