Per-feature normalization by julianmack · Pull Request #23 · MyrtleSoftware/myrtlespeech

julianmack · 2020-01-15T16:32:54Z

V. small PR adding per-feature normalization

julianmack · 2020-01-16T16:27:44Z

  pre_process_step {
    stage: TRAIN_AND_EVAL;
    standardize {
+      norm_type: ALL_FEATURES;


i.e so that behaviour doesn't change although it's likely that setting this to PER_FEATURE will be better.

samgd

Will per-tensor ("all_features") normalisation ever be preferable over per-feature (computed over a single batch)? If not it could be removed entirely to simplify the code base.

It also might be interesting to compare computing the mean and std over the entire dataset rather than each batch?

julianmack · 2020-01-24T13:05:06Z

It also might be interesting to compare computing the mean and std over the entire dataset rather than each batch?

The current implementation is actually computing the mean and std over each sample not over each batch. Each feature is normalised over time in the sequence which gives a loose approximation to per-speaker normalisation (where each sample is assumed to be a different speaker).

So the options are x3:

Per sample - current
Per batch
Per whole dataset

I've had a preliminary think about how we might implement 2. and 3. - I think it's worth discussing which we would like because, I think both will require relatively large changes to the codebase.

2

Should be easy-ish to implement - although it will require re-thinking the preprocessing pipeline as all transforms are applied per-sample at the moment in the dataset class:

class LibriSpeech(Dataset):
    ....
    def __getitem__(self, index: int) -> Tuple[torch.Tensor, str]:
        ...
        if self._transform is not None:
            audio = self._transform(audio)        # <- all preprocessing applied here

I could move some (all?) of the preprocessing steps to the seq_to_seq_collate_fn to achieve normalization per batch?

3

Should be O.K. to implement but will require running the train_loader for all samples at the beginning of training I think. It's not feasible to pre-comute and hardcode these as we would need them for each combination of {dataset, subset, FeatExtractionType, number_features, win_len, hop_len } etc. This would add quite a lot of complexity if we just wanted to run evaluation on a model but it was still necessary to build and run the train loaders to get the mean/std.

A method of avoiding this eval-time faff could be to get model to remember averages when self.training == True effectively adding a normalisation layer at the start of every model.

conclusion

If we're not sure which we expect to be best, it might be worth running some training experiments before proceeding with a proper implementation.

and also

Will per-tensor ("all_features") normalisation ever be preferable over per-feature (computed over a single batch)? If not it could be removed entirely to simplify the code base.

Yes I agree - it will never be better. I'll remove after we decide on a route above.

julianmack added 2 commits January 13, 2020 22:07

Added per-feature normalization

2567fb0

Added proto building for per_feature norm

e68328a

julianmack requested a review from samgd January 15, 2020 16:33

julianmack commented Jan 16, 2020

View reviewed changes

Updated docstrings and added .detach()

834df57

samgd suggested changes Jan 22, 2020

View reviewed changes

Updated docstrings

3f6dfd6

julianmack requested a review from samgd January 24, 2020 14:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Per-feature normalization#23

Per-feature normalization#23
julianmack wants to merge 4 commits into
masterfrom
per_feat_norm

julianmack commented Jan 15, 2020

Uh oh!

julianmack Jan 16, 2020

Uh oh!

samgd left a comment

Uh oh!

julianmack commented Jan 24, 2020 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

julianmack commented Jan 15, 2020

Uh oh!

julianmack Jan 16, 2020

Choose a reason for hiding this comment

Uh oh!

samgd left a comment

Choose a reason for hiding this comment

Uh oh!

julianmack commented Jan 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

2

3

conclusion

and also

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

julianmack commented Jan 24, 2020 •

edited

Loading