From data till model - Questions about Brush pipeline #22

folivetti · 2023-03-20T15:58:45Z

folivetti
Mar 20, 2023
Collaborator

Hi everyone,

I was reading the source code that describes the dataset type and the supported data types. I have some questions concerning the main objective of Brush (and maybe some suggestions, depending on the answers).
So, as I understand, one of the use cases of Brush is, given a medical data (as described in OMOP), to return an interpretable model.
Is Brush supposed to perform data extraction and processing (e.g., doing what https://github.com/OHDSI/FeatureExtraction does), apply features transformation (e.g., encode categorical features) while building a model?
So, let's suppose we have a data set with 3 features, x1, x2, x3, x1 and x2 are numerical and x3 is categorical. Let's say we have different methods to encode categorical features: f1, f2. In this case, the GP can return a model like: x1 * f2(x3) + cos(x2 + f1(x3)). In other words, the GP would be responsible to apply the appropriate transformation and even use different transformations for the same feature in the same expression?

If that's the idea, maybe the data types will need to be modified to take that into account. For example, the dataset should be an heterogeneous list with the type information of each feature and the main algorithm must keep track of the types to ensure the expression tree is correct (this idea can be borrowed from STGP).

Let me know if I am diverging too much from the original idea 😆

lacava · 2023-03-24T22:12:50Z

lacava
Mar 24, 2023
Maintainer

Hi @folivetti

Brush already supports multiple data types, which are stored as a std::variant called State:

brush/src/types.h

Line 143 in 80370be

State;

It includes categorical (integers), float, and boolean arrays, matrices, and timeseries.

The operators are also typed. The types of the operators are specified using a struct called Signatures: https://github.com/cavalab/brush/blob/master/src/program/signatures.h

Take for example the SplitOn operator, which splits the samples on its first argument and returns the values of the second or third argument, respectively. It is currently defined to work on arrays of floating or integer types:

brush/src/program/signatures.h

Line 360 in 80370be

struct Signatures<NodeType::SplitOn>{

Regarding the feature extraction operations you shared, I would like Brush to support as many as possible, with the constraint that, if we want to optimize parameters, the operators need to be differentiable.

One thing I would like to do is support a platform-independent data schema like arrow. I've been thinking about updating the data structure to work with arrow so we have good interoperability.

0 replies

folivetti · 2023-03-28T11:53:59Z

folivetti
Mar 28, 2023
Collaborator Author

Nice! (I looked all over types.h before starting this discussion and didn't manage to spot the variance, dunno why :-) )

I'll have a look and try to think about the different possible operations and how to optimize their arguments, if needed. One thing that comes to mind with the SplitOn is that maybe this should work only with binary and categorical (integer). The reason being you don't actually optimize the argument for these types (there's no need to optimize the binary, and the categorical value can be left for the evolutionary process, as it is a limited set of values). For numerical features we can have a SplitOnLT that will split when the value of a certain feature is less than a threshold (and this one should be optimized).

I'm now trying to identify the many reasons we would split the data using a threshold value. This could help us find some relevant literature. One reason being that the phenomena we are trying to model is naturally piecewise: there's an abrupt change at some point, there are different treatments involved, etc. These two papers here seems to be a nice starting point

https://esajournals.onlinelibrary.wiley.com/doi/pdf/10.1890/02-0472
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0276755

I'll also have a look at Donald Watts papers since he's one of the main authors in nonlinear regression field.

Another reason is that we want to generate two (or more) simpler models, in this situation you want to fit the two branches of the tree with different parts of the dataset but to do that adequately, you must know the best split (chicken-egg situation). One idea is to fit both branches with the whole data using a robust loss function (e.g., soft l1 loss). After fitting the data, select the best threshold that minimizes the combined prediction error of both models (this can be done in O(n)), and refit both branches with their own split using least squares.

2 replies

lacava Mar 28, 2023
Maintainer

One of the design goals I had with Brush is to not leave any program parameters up to evolution. So currently SplitOn behaves like classical CART decision trees when choosing a splitting threshold. In the binary case, there is no optimization - so this leaves an option for global search to find an appropriate expression. However for categorical and floating point types, the best threshold is found heuristically based on minimizing the Gini impurity of the split (for classification problems) or the variance of it (for regression). Currently all this is handled by Brush; see here and here

The "chicken-egg" situation you describe is one that I think we're well-suited to address. Namely, one of the shortcoming of classic decision trees is that they are greedily constructed. In our case, we are using a global search algorithm. So, we should be able to optimize the orderings of splits/models via the GA. A second short-coming is that they only learn feature axis-aligned splits, whereas Brush programs can be more expressive.

Thanks for the references. I think there are a lot of applications of this type of model. My original motivation happened during my PhD, working on a fluid-structure interaction problem called vortex induced vibration https://en.wikipedia.org/wiki/Vortex-induced_vibration. Basically, when you are in a different laminar flow region, the behavior of the system changes drastically. So piece-wise models can be quite important. It's also true in computable phenotyping. https://sites.duke.edu/rethinkingclinicaltrials/ehr-phenotyping/

As to whether to learn the split threshold or the weights first, your idea is interesting. But it seems costly. Imagine we have a tree with multiple splits; we could get into a situation where we have to manage a lot of dependencies. It seems simpler to me to learn the split first, then the parameters, but I'm open to ideas. It's true, for example, that we don't necessarily need the split to produce "flat" regions, since we have more flexibility downstream (recall decision tree leaves are simple constants). For example, for regression, we could instead reward the split that linearizes the output on each side.

folivetti Mar 29, 2023
Collaborator Author

about the integer type = categorical. I only see one problem with it, if we want to model ordinal features as well we should choose to support the whole comparison operators for the integer (so integer is either categorical or ordinal) or constrain to only equality. Maybe a better approach is to create two integer types one for categorical and another for ordinal.

About optimizing the split, I'll think a little bit more about my idea. My hunch is that we can turn that as a "evaluate everything and find the optimal partition".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

From data till model - Questions about Brush pipeline #22

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

From data till model - Questions about Brush pipeline #22

Uh oh!

folivetti Mar 20, 2023 Collaborator

Replies: 2 comments · 2 replies

Uh oh!

lacava Mar 24, 2023 Maintainer

Uh oh!

folivetti Mar 28, 2023 Collaborator Author

Uh oh!

lacava Mar 28, 2023 Maintainer

Uh oh!

folivetti Mar 29, 2023 Collaborator Author

folivetti
Mar 20, 2023
Collaborator

Replies: 2 comments 2 replies

lacava
Mar 24, 2023
Maintainer

folivetti
Mar 28, 2023
Collaborator Author

lacava Mar 28, 2023
Maintainer

folivetti Mar 29, 2023
Collaborator Author