SPHDM

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

SPHDM

A method of Structure based Phish Homology Detection Model (SPHDM) is proposed to detect the phishing webpages.

Thank you for your interests in our work!

The dataset used by SPHDM for training and testing is deposited here in SPHDM/dataset.

Dataset Sources

The webpages used in the experiments come from Internet. Among them, the benign webpage collection is from Alexa . Alexa is a website maintained by Amazon that publishes the world rankings of websites. We collect webpages in the top list provided by Alexa which are considered as benign webpages. After filtering out invalid, error, and duplicate pages, 10,922 benign webpages are collected.

The phishing webpage collection comes from PhishTank.org. PhishTank is an internationally well-known website which collects suspected phish submitted by anyone, verifies it according to whether it has a fraudulent attempt or not, and then publish a timely and authoritative list of phishing webpages for research. Due to the short survival time of phishing webpages, we collected totally 10,944 phishing webpages listed on PhishTank every day from September 2019 to November 2019, and processed the webpages that did not meet the grammar rules.

phishing webpage: PhishTank
benign webpage: Alexa

Pretreatment

After webpages are crawled and preprocessed, two files are created. The page_PCA_N.7z contains all benign webpages, and the page_PCA_P.7z contains all phishing webpages.
The processed data is stored in address.
The format of one webpage in files is as follows:

#	Attribute	Description	Type	Nullable
1	classstyle	The .class selector is the style of all elements of the specified class	Array	No
2	hashcode	DOM sequence hash encoding based on SHA-1	String	No
3	idstyle	The id selector can specify a specific style for HTML elements marked with a specific id.	Dictionary	Yes
4	name	This is the identifier of a record	String	No
5	newtagseq	The depth-first traversal strategy collects the tag sequence, and attaches the number of layers where the tag is located	Array	No
6	tagseq	Depth-first traversal strategy to collect tag sequence	Array	No

The extraction details of classstyle and newtagseq can be found in section 3.3.1 and 3.3.2.

Run The Following Commands

python FingerPrintCluster.py nfold <10>  <normal sites info> <phishing sites info>  <result> <0.2>

Usage Policy and Legal Disclaimer

This dataset is being distributed only for Research purposes, under Creative Commons Attribution-Noncommercial-ShareAlike license (CC BY-NC-SA 4.0). By clicking on the download buttons, you are agreeing to use this data only for non-commercial, research, or academic applications. You may cite the above paper if you use this dataset.

Contact

You can download this notebook as well as the well-organized dataset for training and testing. The toy example for visualization is in SPHDM Respository. If you find this work interesting and helpful to your work, please find the citation of the papers as below. Thank you very much. Any question you can email to actour@163.com.

@inproceedings{feng2021SPHDM, title={Detecting Phishing Webpages via Homology Analysis of Webpage Structure}, author={Jian Feng, Yuqiang Qiao, Ou Ye, and Ying Zhang }}

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
dataset		dataset
FingerPrintCluster.py		FingerPrintCluster.py
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPHDM

Dataset Sources

Pretreatment

Run The Following Commands

Usage Policy and Legal Disclaimer

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SPHDM

Dataset Sources

Pretreatment

Run The Following Commands

Usage Policy and Legal Disclaimer

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages