Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

Our paper "Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization" is available at https://arxiv.org/abs/2412.18525.

📚Dataset (Explanatory-based Vison Tasks) | 📚Dataset (Terminological-based Vision Tasks) | 🤗 Model (UVT-7B-448)

Code Base

We build our code based on Chameleon, Luminar-mGPT and LLaMA2-Accessory.

⚙️ Installation

Chameleon VQ-VAE

Since currently the Chameleon implementation in transformers does not contain the VQ-VAE decoder, please manually download the original VQ-VAE weights provided by Meta and put them to the following directory:

- ckpts/
    - chameleon/
        - tokenizer/
            - text_tokenizer.json
            - vqgan.yaml
            - vqgan.ckpt

Python Environment

conda create -n py310 python=3.10 -y
conda install cudatoolkit=11.8 -y
conda activate py310
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.6.3+cu118torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl --no-build-isolation
pip install -r requirements.txt

Simple Inference

The simplest code for inference (refer to demo_inference.py):

from inference_solver import FlexARInferenceSolver
from PIL import Image
import os
import torch
import random
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

inference_solver = FlexARInferenceSolver(
    model_path = "UVT_7B_448", #path to your model
    precision="fp16", #bf16
    target_size=448, #fixed 448
)

max_out = 1
for i in range(max_out):
    set_seed(i)
    
    qas = [["Acknowledge the spatial structure and identify variations in light intensity, translating these into a gradient scale representing distances. Accentuate regions where light diminishes gradually, enhancing the perception of depth by dimming peripheral areas. Adjust the distribution of luminance to highlight the central vanishing point, converting detailed textures into smooth transitions of grayscale." + " <|image|>", None]]
    images = [Image.open("./demo_input/rain_1.jpg")]

    generated = inference_solver.generate(
        images=images,
        qas=qas,
        max_gen_len=4096,
        temperature=1.0,
        logits_processor=inference_solver.create_logits_processor(cfg=1., image_top_k=2048),
    )
    new_image = generated[1][0]
    new_image.save(f'./test_output_{i}.png', format='PNG')

Training

1. Pre-Tokenization

This stage tokenizes each data point, consisting of interleaved image and text, into a single sequence of integer tokens. After tokenization, the sequence is saved to disk for trainining-time usage. Together with the saved tokens, a json-formatted record file is also generated for indexing all the saved token files. For faster tokenization, you may use multiple GPUs and dispatch different subsets of data to them.

We provide pre-tokenization samples in './pre_tokenize'. Running commend can refer to the 'py' files in the folder.We also provide samples for output files, please refer to './json/edit_resolution_448/Allweather'.

You can deal with your own dataset by following these stages, or you can download dataset wo provide in Huggingface and write a simple Pre-Tokenization script.

2. Training

Please refer to exps/7B.sh

Evaluation

Evaluation Environment

Additional environment for ControlNet++ is required for the evaluation stage.

Evaluation Data and stage

We provide json for evalution data in 'Evaluation_Data_JSON'. Then you have to follow these json files to generate images and evaluate them respectively. You can also use instructions in the paper (cf. Appendix A) to generate your own data for evalution.

Samples for Zero-shot Capabilities on Vision Tasks (Relatively Simple Samples)

Instruction-level Zero-shot Samples (Depth Estimation)

Input Image	Unseen Explanatory Instruction	Output Image	Ground Truth
_{Input Image}	`Acknowledge the spatial structure and identify variations in light intensity, translating these into a gradient scale representing distances. Accentuate regions where light diminishes gradually, enhancing the perception of depth by dimming peripheral areas. Adjust the distribution of luminance to highlight the central vanishing point, converting detailed textures into smooth transitions of grayscale.`	_{Output Image}	_{Ground Truth}
_{Input Image}	`Start by analyzing the spatial layout to identify key structural elements. Gradually obscure less relevant details in the periphery to focus primarily on central depth. Increase contrast between light and dark areas to enhance perception of distance. Transition the textures into smooth gradients to reflect variations in depth, with a focus on enhanced luminosity for regions that are further away.`	_{Output Image}	_{Ground Truth}
_{Input Image}	`Convert each region’s color intensity to a grayscale value corresponding to its relative distance from the viewer, with nearer objects appearing lighter and those farther away darker. Gradually smooth transitions between these regions to reflect continuous depth variation. Remove textural details that do not affect perceived depth to create uniformity based on object proximity. Adjust overall brightness to highlight the spatial configuration without explicit texture representation.`	_{Output Image}	_{Ground Truth}

Instruction-level Zero-shot Samples (Surface Normal Estimation).

Input Image	Unseen Explanatory Instruction	Output Image	Ground Truth
_{Input Image}	`Translate the visible structures into a range of bright colors reflecting orientation angles, enhancing variations across surfaces.`	_{Output Image}	_{Ground Truth}
_{Input Image}	`Convert visual elements into a spectrum of colors that represent the directionality of surfaces, capturing the angles and orientations vividly.`	_{Output Image}	_{Ground Truth}
_{Input Image}	`Translate the scene into a colorful array to indicate surface orientations and angles.`	_{Output Image}	_{Ground Truth}

Instruction-level Zero-shot Samples (HED Boundary Detection).

Input Image	Unseen Explanatory Instruction	Output Image	Ground Truth
_{Input Image}	`Capture the outline and prominent edges of the cylindrical object and its surroundings, simplify everything by removing textures and detailed surfaces, and emphasize only the contours and distinct features while rendering a higher contrast between light and dark regions with sharp shifts in tones.`	_{Output Image}	_{Ground Truth}
_{Input Image}	The vibrant scene with multiple colors and details could be simplified into a monochrome representation. First, focus on defining the high-contrast areas between light and dark in a much starker, black-and-white way. Then, it's important to emphasize contours and significant edges, such as the lines around the face, the dress’ folds, and the furniture's details, while downplaying softer gradients. Removing extraneous colors and textures leaves behind only the essential structural features that provide a more abstract, but recognizable silhouette and objects.	_{Output Image}	_{Ground Truth}
_{Input Image}	`Begin by eliminating most of the intricate details and colors, transforming the vibrant elements into simplified outlines. Keep only the borders and defined structures, ensuring that the environment and figure take on an abstract form. Remove all texture, reducing the entire composition to minimal contrasting edges that define the shapes more than the details.`	_{Output Image}	_{Ground Truth}

Instruction-level Zero-shot Samples (Dehazing).

Input Image	Unseen Explanatory Instruction	Output Image	Ground Truth
_{Input Image}	`Gradually reduce atmospheric interference, allowing clearer visibility of buildings and sharpening the outlines. Enhance clarity and brightness to bring out the details within the cityscape, providing a crisper view.`	_{Output Image}	_{Ground Truth}
_{Input Image}	`Increasing the clarity by reducing haze, enhancing contrast, and deepening colors to give a sharper and more vibrant appearance to the scene.`	_{Output Image}	_{Ground Truth}
_{Input Image}	`To achieve clarity and vibrancy, adjust the brightness and reduce the foggy effect. Enhance the sharpness of the trees and structures, allowing their details to stand out against the clear blue sky.`	_{Output Image}	_{Ground Truth}

Instruction-level Zero-shot Samples (Deraining).

Input Image	Unseen Explanatory Instruction	Output Image	Ground Truth
_{Input Image}	`Imagine a scenario where rainfall suddenly stops and the water settles, clearing up the scene to enhance visibility and eliminate rain streaks.`	_{Output Image}	_{Ground Truth}
_{Input Image}	`Remove the raindrops and streaks, focusing on enhancing clarity and brightness to achieve a crisp and rain-free appearance in the environment.`	_{Output Image}	_{Ground Truth}
_{Input Image}	`Imagine the rainfall gradually lessening until the sky clears completely, leaving only the vibrant greenery and the birds in focus.`	_{Output Image}	_{Ground Truth}

Instruction-level Zero-shot Samples (Segmentation).

Input Image	Unseen Explanatory Instruction	Output Image	Ground Truth
_{Input Image}	`Apply a pink color overlay to bicycles, completely matching their shapes.`	_{Output Image}	_{Ground Truth}
_{Input Image}	`Apply a solid grey color tint to fully cover one banana instance.Paint over each stove with a powderblue color.`	_{Output Image}	_{Ground Truth}
_{Input Image}	`Spectral_r is the reversed version of Spectral, transitioning through red, yellow, green, and blue. Based on the previously defined colors, help me complete the segmentation task below. Color all instances of bucket, toilet using Spectral_r colors, following their contours precisely.`	_{Output Image}	_{Ground Truth}

Samples for Zero-shot Capabilities on Vision Tasks (Relatively Hard Samples)

Task-level & Instruction-level Zero-shot Samples (Canny-to-Image)

Explanatory Instruction: "Fill in all the empty outlines with rich colors that reflect vibrant tones, while redefining the shapes with smooth textures. Add layers of depth to the flat contours by enhancing brightness gradients in the sky, shadowing in the mountains, and intricate shades among the flowers. Reintroduce the sensation of open space and dimension by contrasting sharp objects with muted backgrounds and crisp details in the foreground." Resolution: 448×448.

Instruction-level Zero-shot Samples (Deraining)

Explanatory Instruction: "Slowly remove the rain falling from the sky in the image, still maintain the state of night, and the girl on the bridge is also still holding the umbrella, but readjust the light in the distance." Limitations: The model struggles to preserve smaller objects and environmental details. Resolution: 448×448.

Task-level & Instruction-level Zero-shot Samples (Low-light Enhancement)

Explanatory Instruction: "Increase the overall brightness to reveal details in dark areas while preserving highlights. Adjust the contrast to enhance the brightness differences between regions, making the structures and textures more distinct. Optimize color saturation to make previously dull colors more vibrant, such as the blue on the floor becoming more prominent. Apply denoising to reduce noise commonly found in low-light images, improving the overall quality. Ensure the final image appears natural while retaining the authentic style of the scene." Limitations: Controlling the intensity of lighting enhancement through language instructions is challenging, often resulting in significant deviations in the output. Resolution: 448×448.

Instruction-level Zero-shot Samples (Desnowing)

Explanatory Instruction: "Remove the falling snow from the sky in the image, keep the other objects and snow in the image, still keep it dark, but pay attention to the adjustment of light behind the tree." Limitations: The second generated image struggles to retain nighttime details, while the third and fourth images exhibit poor performance in removing snow from the sky. Additionally, attempting to remove snow from the ground simultaneously can result in significant distortions. Resolution: 448×448.

Task-level & Instruction-level Zero-shot Samples (Deblurring)

Explanatory Instruction: "The image shows noticeable multiple visual overlaps of trees and buildings. I would like to remove visual overlaps and restore a clear, sharp image without blurring. Do not alter the main content and pay attention to adjusting the light." Limitations: The success rate of guiding the model's task-level zero-shot capability through language instructions is relatively low. Resolution: 448×448.

Instruction-level Zero-shot Samples (Dehazing)

Explanatory Instruction: "Retain the distant clouds in the image while removing as much fog as possible. Attempt to restore the faintly visible sun in the distance, but ensure there is no strong sunlight. Focus on recovering the mountains and the nearby trees as much as possible." Limitations: It will cause distortions in certain objects. Resolution: 448×448.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
Evaluation_Data_JSON		Evaluation_Data_JSON
Explanatory_Instructions_Tuning		Explanatory_Instructions_Tuning
assets		assets
README.md		README.md
paper.pdf		paper.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

Code Base

⚙️ Installation

Chameleon VQ-VAE

Python Environment

Simple Inference

Training

1. Pre-Tokenization

2. Training

Evaluation

Evaluation Environment

Evaluation Data and stage

Samples for Zero-shot Capabilities on Vision Tasks (Relatively Simple Samples)

Instruction-level Zero-shot Samples (Depth Estimation)

Instruction-level Zero-shot Samples (Surface Normal Estimation).

Instruction-level Zero-shot Samples (HED Boundary Detection).

Instruction-level Zero-shot Samples (Dehazing).

Instruction-level Zero-shot Samples (Deraining).

Instruction-level Zero-shot Samples (Segmentation).

Samples for Zero-shot Capabilities on Vision Tasks (Relatively Hard Samples)

Task-level & Instruction-level Zero-shot Samples (Canny-to-Image)

Instruction-level Zero-shot Samples (Deraining)

Task-level & Instruction-level Zero-shot Samples (Low-light Enhancement)

Instruction-level Zero-shot Samples (Desnowing)

Task-level & Instruction-level Zero-shot Samples (Deblurring)

Instruction-level Zero-shot Samples (Dehazing)

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

Code Base

⚙️ Installation

Chameleon VQ-VAE

Python Environment

Simple Inference

Training

1. Pre-Tokenization

2. Training

Evaluation

Evaluation Environment

Evaluation Data and stage

Samples for Zero-shot Capabilities on Vision Tasks (Relatively Simple Samples)

Instruction-level Zero-shot Samples (Depth Estimation)

Instruction-level Zero-shot Samples (Surface Normal Estimation).

Instruction-level Zero-shot Samples (HED Boundary Detection).

Instruction-level Zero-shot Samples (Dehazing).

Instruction-level Zero-shot Samples (Deraining).

Instruction-level Zero-shot Samples (Segmentation).

Samples for Zero-shot Capabilities on Vision Tasks (Relatively Hard Samples)

Task-level & Instruction-level Zero-shot Samples (Canny-to-Image)

Instruction-level Zero-shot Samples (Deraining)

Task-level & Instruction-level Zero-shot Samples (Low-light Enhancement)

Instruction-level Zero-shot Samples (Desnowing)

Task-level & Instruction-level Zero-shot Samples (Deblurring)

Instruction-level Zero-shot Samples (Dehazing)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages