Janus pro mix dataset#13
Conversation
|
/gemini |
|
It looks like you might have intended to run a command. The If you need a list of all available commands, please use |
|
/gemini review |
|
/gemini summary |
There was a problem hiding this comment.
Code Review
This pull request introduces a MixDataset to combine text, VQA, and text-to-image datasets for mixed-task training. The implementation has a few areas for improvement in terms of correctness, efficiency, and adherence to best practices. My review includes suggestions to fix the dataset order, use more efficient numpy operations, and clean up minor code style issues.
| from janus.train.vqa_dataset import VqaDataset | ||
|
|
||
|
|
||
| class MixDataset: |
| def cumsum(sequence): | ||
| r, s = [], 0 | ||
| for e in sequence: | ||
| l = len(e) | ||
| r.append(l + s) | ||
| s += l | ||
| return r |
There was a problem hiding this comment.
This manual implementation of cumsum can be simplified and made more efficient by using numpy.cumsum. Since numpy is already a dependency, this would be more idiomatic and concise. bisect works fine with NumPy arrays.
| def cumsum(sequence): | |
| r, s = [], 0 | |
| for e in sequence: | |
| l = len(e) | |
| r.append(l + s) | |
| s += l | |
| return r | |
| def cumsum(sequence): | |
| return np.cumsum([len(e) for e in sequence]) |
| self.default_image_shape = default_image_shape | ||
| self.max_token_length = max_token_length | ||
| self.datasets = datasets | ||
| self.num_dataset = len(datasets) |
| # add image and image_seq_mask item to pure text for batching | ||
| if dataset_idx == 0: | ||
| image = np.zeros(self.default_image_shape, np.float32) | ||
| image_seq_mask = np.zeros((self.max_token_length), dtype=np.bool) |
There was a problem hiding this comment.
The np.bool alias is deprecated since NumPy 1.20 and will be removed in a future version. You should use the standard Python bool type instead for forward compatibility.
| image_seq_mask = np.zeros((self.max_token_length), dtype=np.bool) | |
| image_seq_mask = np.zeros((self.max_token_length), dtype=bool) |
| sample_weights = [] | ||
| assert len(sample_ratios) == len(datasets) | ||
| for i in range(len(sample_ratios)): | ||
| weight = sample_ratios[i] * len(mix_dataset) / len(datasets[i]) | ||
| sample_weights += [weight] * len(datasets[i]) |
There was a problem hiding this comment.
The loop for creating sample_weights can be written more concisely and efficiently using NumPy operations. Using np.repeat is more idiomatic for this kind of task and can be faster for large datasets.
| sample_weights = [] | |
| assert len(sample_ratios) == len(datasets) | |
| for i in range(len(sample_ratios)): | |
| weight = sample_ratios[i] * len(mix_dataset) / len(datasets[i]) | |
| sample_weights += [weight] * len(datasets[i]) | |
| assert len(sample_ratios) == len(datasets) | |
| weights_per_dataset = np.array([ | |
| r * len(mix_dataset) / len(d) for r, d in zip(sample_ratios, datasets) | |
| ]) | |
| counts = [len(d) for d in datasets] | |
| sample_weights = np.repeat(weights_per_dataset, counts).tolist() |
Summary of ChangesThis pull request introduces a foundational component for multi-modal training within the Janus Pro project. It provides a robust mechanism to combine and sample from disparate datasets—pure text, VQA, and text-to-image—into a single, coherent data stream for MindSpore-based models. This enhancement is crucial for training models that require exposure to a wide variety of data types to improve their generalization and performance across different tasks. Highlights
Changelog
Activity
|
What does this PR do?
Add janus pro mix dataset