Skip to content

qiaodaben/SPHDM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

SPHDM

A method of Structure based Phish Homology Detection Model (SPHDM) is proposed to detect the phishing webpages.

Thank you for your interests in our work!

The dataset used by SPHDM for training and testing is deposited here in SPHDM/dataset.

Dataset Sources

The webpages used in the experiments come from Internet. Among them, the benign webpage collection is from Alexa . Alexa is a website maintained by Amazon that publishes the world rankings of websites. We collect webpages in the top list provided by Alexa which are considered as benign webpages. After filtering out invalid, error, and duplicate pages, 10,922 benign webpages are collected.

The phishing webpage collection comes from PhishTank.org. PhishTank is an internationally well-known website which collects suspected phish submitted by anyone, verifies it according to whether it has a fraudulent attempt or not, and then publish a timely and authoritative list of phishing webpages for research. Due to the short survival time of phishing webpages, we collected totally 10,944 phishing webpages listed on PhishTank every day from September 2019 to November 2019, and processed the webpages that did not meet the grammar rules.

phishing webpage: PhishTank
benign webpage: Alexa

Pretreatment

After webpages are crawled and preprocessed, two files are created. The page_PCA_N.7z contains all benign webpages, and the page_PCA_P.7z contains all phishing webpages.
The processed data is stored in address.
The format of one webpage in files is as follows:

#

Attribute

Description

Type

Nullable

1

classstyle

The .class selector is the style of all elements of the specified class

Array

No

2

hashcode

DOM sequence hash encoding based on SHA-1

String

No

3

idstyle

The id selector can specify a specific style for HTML elements marked with a specific id.

Dictionary

Yes

4

name

This is the identifier of a record

String

No

5

newtagseq

The depth-first traversal strategy collects the tag sequence, and attaches the number of layers where the tag is located

Array

No

6

tagseq

Depth-first traversal strategy to collect tag sequence

Array

No

The extraction details of classstyle and newtagseq can be found in section 3.3.1 and 3.3.2.

Run The Following Commands

python FingerPrintCluster.py nfold <10>  <normal sites info> <phishing sites info>  <result> <0.2>

Usage Policy and Legal Disclaimer

This dataset is being distributed only for Research purposes, under Creative Commons Attribution-Noncommercial-ShareAlike license (CC BY-NC-SA 4.0). By clicking on the download buttons, you are agreeing to use this data only for non-commercial, research, or academic applications. You may cite the above paper if you use this dataset.

Contact

You can download this notebook as well as the well-organized dataset for training and testing. The toy example for visualization is in SPHDM Respository. If you find this work interesting and helpful to your work, please find the citation of the papers as below. Thank you very much. Any question you can email to actour@163.com.

@inproceedings{feng2021SPHDM, title={Detecting Phishing Webpages via Homology Analysis of Webpage Structure}, author={Jian Feng, Yuqiang Qiao, Ou Ye, and Ying Zhang }}

About

# SPHDM A Structure based Phish Homology Detection Model (SPHDM) is proposed to detected the phishing web. Thank you for your interests in our work! The dataset we ultilized for training and testing for is reposited in github. Address: https://github.com/qiaodaben/SPHDM/dataset You can download this notebook as well as the well-organized dataset…

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages