mdlp

Discretization algorithm based on the paper by Fayyad & Irani Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning

The implementation tries to mitigate the problem of different label values with the same value of the variable:

Sorts the values of the variable using the label values as a tie-breaker
Once found a valid candidate for the split, it checks if the previous value is the same as actual one, and tries to get previous one, or next if the former is not possible.

Other features:

Intervals with the same value of the variable are not taken into account for cutpoints.
Intervals have to have more than two examples to be evaluated (mdlp).
The algorithm returns the cut points for the variable.
The transform method uses the cut points returning its index in the following way:
```
  cut[i - 1] <= x < cut[i]
```
using the std::upper_bound method
K-Bins discretization is also implemented, and "quantile" and "uniform" strategies are available.
Proportional k-Interval Discretization (PKID) is also implemented.

Discretization Algorithms

This library provides three discretization algorithms:

Fayyad & Irani's MDL discretization (CPPFImdlp): This is the main algorithm of the library. It is a supervised discretization algorithm that uses the Minimum Description Length Principle (MDLP) to find the best cut points for a continuous attribute.
Binning Discretization (BinDisc): This is a simple unsupervised discretization algorithm that divides the range of a continuous attribute into a fixed number of bins. Two strategies are available:
- uniform: creates bins of equal width.
- quantile: creates bins with an equal number of samples.
Proportional k-Interval Discretization (PKIDisc): This is an unsupervised discretization algorithm that uses the square root of the number of samples as the number of bins for a quantile binning, that is equal-frequency binning. It is based on the paper by Yang & Webb, "Proportional k-Interval Discretization for Naive-Bayes Classifiers".

Unified Interface

All discretization algorithms share a common interface with the fit(X, y) method signature:

// Supervised: uses y for label information
CPPFImdlp disc1;
disc1.fit(X, y);
auto result1 = disc1.transform(X);

// Unsupervised: accepts y but doesn't use it (for interface consistency)
BinDisc disc2(3, strategy_t::QUANTILE);
disc2.fit(X, y);  // y is ignored
auto result2 = disc2.transform(X);

PKIDisc disc3;
disc3.fit(X, y);  // y is ignored
auto result3 = disc3.transform(X);

This design enables a single code path in experimentation platforms:

void run_experiment(Discretizer& disc, samples_t& X, labels_t& y) {
    disc.fit(X, y);  // Works for ALL discretizer types
    auto result = disc.transform(X);
}

Note: For unsupervised methods (BinDisc, PKIDisc), the y parameter is accepted for interface consistency but is not used in the discretization algorithm.

Sample

To run the sample, just execute the following commands:

make build
build_release/sample/sample -f iris -m 2
build_release/sample/sample -h

Test

To run the tests and see coverage (llvm with lcov and genhtml have to be installed), execute the following commands:

make test

Building from Source

Using CMake

mkdir build && cd build
cmake ..
make

Using Conan

conan install . --output-folder=build --build=missing
cd build
cmake .. -DCMAKE_TOOLCHAIN_FILE=conan_toolchain.cmake -DCMAKE_BUILD_TYPE=Release
cmake --build .

Dependencies

PyTorch (libtorch): Required for tensor operations
C++17: Standard required for modern C++ features
CMake 3.20+: Build system

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
.conan/profiles		.conan/profiles
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.vscode		.vscode
config		config
sample		sample
scripts		scripts
src		src
test_consumer		test_consumer
test_package		test_package
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
CONAN_README.md		CONAN_README.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RELEASE_PLAN_V3.md		RELEASE_PLAN_V3.md
TECHNICAL_ANALYSIS_REPORT.md		TECHNICAL_ANALYSIS_REPORT.md
conandata.yml		conandata.yml
conanfile.py		conanfile.py
fimdlpConfig.cmake.in		fimdlpConfig.cmake.in
getversion.py		getversion.py
logo.png		logo.png
sonar-project.properties		sonar-project.properties
update_coverage.py		update_coverage.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mdlp

Discretization Algorithms

Unified Interface

Sample

Test

Building from Source

Using CMake

Using Conan

Dependencies

License

About

Uh oh!

Releases 12

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mdlp

Discretization Algorithms

Unified Interface

Sample

Test

Building from Source

Using CMake

Using Conan

Dependencies

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages