This repository contains supplemental materials including an additional document
and a code set for appying our K-means clustering algorithm, AF-ICP,
to Large-scale and High-dimensional sparse data sets such as
the 8.2M-sized PubMed data set and comparing it with the other algorithms,
ICP, TA-ICP, and CS-ICP in ./Comparision.
The codes are implemented with C.
- OS: CentOS 7.6 and later
- g++ (GCC): >= 8.2.0
- perl: >= 5.16
- perf: 3.10
- bzip2 (optional)
- Prepare the 8.2M-sized PubMed data set with a procedure in dataset.
This procedure creates./dataset/pubmed.8_2M.dbthat is avilable for the codes in this repository.
You can download pubmed.8_2M.db.bz2 if you fail to download the original data (docword.pubmed.txt) from UCI machine learning repository. Then, executebzip2 -d pubmed.8_2M.db.bz2to extract thepubmed.8_2M.dband move it to./datasetdirectory. - Execute
make -f Makefile_itr5_aficpin./src.
This makes./bin/itr5_aficpobject in your system. - Execute the perl script
./itr5_exeAFICP_8.2Mpubmed_perf.plin./exe.
The 8.2M-sized PubMed data set is loaded from./dataset/pubmed.8_2M.db(3.8GB) in around two minutes and given K=10,000, AF-ICP is executed with 50-thread parallel processing (default).
You can change default values in the perl scripts. For instance, the number of threads is defined by$NumThreadsin the script.
A log file is generated in./Log.
Go to Comparison.
Please check LICENSE for the detail.