This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
A method of Structure based Phish Homology Detection Model (SPHDM) is proposed to detect the phishing webpages.
Thank you for your interests in our work!
The dataset used by SPHDM for training and testing is deposited here in SPHDM/dataset.
The webpages used in the experiments come from Internet. Among them, the benign webpage collection is from Alexa . Alexa is a website maintained by Amazon that publishes the world rankings of websites. We collect webpages in the top list provided by Alexa which are considered as benign webpages. After filtering out invalid, error, and duplicate pages, 10,922 benign webpages are collected.
The phishing webpage collection comes from PhishTank.org. PhishTank is an internationally well-known website which collects suspected phish submitted by anyone, verifies it according to whether it has a fraudulent attempt or not, and then publish a timely and authoritative list of phishing webpages for research. Due to the short survival time of phishing webpages, we collected totally 10,944 phishing webpages listed on PhishTank every day from September 2019 to November 2019, and processed the webpages that did not meet the grammar rules.
phishing webpage: PhishTank
benign webpage: Alexa
After webpages are crawled and preprocessed, two files are created. The page_PCA_N.7z contains all benign webpages, and the page_PCA_P.7z contains all phishing webpages.
The processed data is stored in address.
The format of one webpage in files is as follows:
|
# |
Attribute |
Description |
Type |
Nullable |
|
1 |
classstyle |
The .class selector is the style of all elements of the specified class |
Array |
No |
|
2 |
hashcode |
DOM sequence hash encoding based on SHA-1 |
String |
No |
|
3 |
idstyle |
The id selector can specify a specific style for HTML elements marked with a specific id. |
Dictionary |
Yes |
|
4 |
name |
This is the identifier of a record |
String |
No |
|
5 |
newtagseq |
The depth-first traversal strategy collects the tag sequence, and attaches the number of layers where the tag is located |
Array |
No |
|
6 |
tagseq |
Depth-first traversal strategy to collect tag sequence |
Array |
No |
The extraction details of classstyle and newtagseq can be found in section 3.3.1 and 3.3.2.
python FingerPrintCluster.py nfold <10> <normal sites info> <phishing sites info> <result> <0.2>
This dataset is being distributed only for Research purposes, under Creative Commons Attribution-Noncommercial-ShareAlike license (CC BY-NC-SA 4.0). By clicking on the download buttons, you are agreeing to use this data only for non-commercial, research, or academic applications. You may cite the above paper if you use this dataset.
You can download this notebook as well as the well-organized dataset for training and testing. The toy example for visualization is in SPHDM Respository. If you find this work interesting and helpful to your work, please find the citation of the papers as below. Thank you very much. Any question you can email to actour@163.com.
@inproceedings{feng2021SPHDM, title={Detecting Phishing Webpages via Homology Analysis of Webpage Structure}, author={Jian Feng, Yuqiang Qiao, Ou Ye, and Ying Zhang }}
