Replies: 2 comments 2 replies
-
|
Hi @folivetti Brush already supports multiple data types, which are stored as a std::variant called Line 143 in 80370be It includes categorical (integers), float, and boolean arrays, matrices, and timeseries. The operators are also typed. The types of the operators are specified using a struct called Take for example the brush/src/program/signatures.h Line 360 in 80370be Regarding the feature extraction operations you shared, I would like Brush to support as many as possible, with the constraint that, if we want to optimize parameters, the operators need to be differentiable. One thing I would like to do is support a platform-independent data schema like arrow. I've been thinking about updating the data structure to work with arrow so we have good interoperability. |
Beta Was this translation helpful? Give feedback.
-
|
Nice! (I looked all over types.h before starting this discussion and didn't manage to spot the I'll have a look and try to think about the different possible operations and how to optimize their arguments, if needed. One thing that comes to mind with the I'm now trying to identify the many reasons we would split the data using a threshold value. This could help us find some relevant literature. One reason being that the phenomena we are trying to model is naturally piecewise: there's an abrupt change at some point, there are different treatments involved, etc. These two papers here seems to be a nice starting point https://esajournals.onlinelibrary.wiley.com/doi/pdf/10.1890/02-0472 I'll also have a look at Donald Watts papers since he's one of the main authors in nonlinear regression field. Another reason is that we want to generate two (or more) simpler models, in this situation you want to fit the two branches of the tree with different parts of the dataset but to do that adequately, you must know the best split (chicken-egg situation). One idea is to fit both branches with the whole data using a robust loss function (e.g., soft l1 loss). After fitting the data, select the best threshold that minimizes the combined prediction error of both models (this can be done in O(n)), and refit both branches with their own split using least squares. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
I was reading the source code that describes the dataset type and the supported data types. I have some questions concerning the main objective of Brush (and maybe some suggestions, depending on the answers).
So, as I understand, one of the use cases of Brush is, given a medical data (as described in OMOP), to return an interpretable model.
Is Brush supposed to perform data extraction and processing (e.g., doing what https://github.com/OHDSI/FeatureExtraction does), apply features transformation (e.g., encode categorical features) while building a model?
So, let's suppose we have a data set with 3 features, x1, x2, x3, x1 and x2 are numerical and x3 is categorical. Let's say we have different methods to encode categorical features: f1, f2. In this case, the GP can return a model like:
x1 * f2(x3) + cos(x2 + f1(x3)). In other words, the GP would be responsible to apply the appropriate transformation and even use different transformations for the same feature in the same expression?If that's the idea, maybe the data types will need to be modified to take that into account. For example, the dataset should be an heterogeneous list with the type information of each feature and the main algorithm must keep track of the types to ensure the expression tree is correct (this idea can be borrowed from STGP).
Let me know if I am diverging too much from the original idea 😆
Beta Was this translation helpful? Give feedback.
All reactions