GitHub - Langer81/Summer-REU-Research: A classifier to identify fake news

"# Summer-REU-Research" To the next PSU student who inherits this code:

The goal of this project is to create a multi-nomial classifier which takes an article url and labels it as either fake, real, satirical, polarized, opinion, misreporting, or persuasive information. (Each type is explained in the explication paper).

Important files:

feature_extraction.py - ArticleVector() will be your best friend this summer. ArticleVector will be how you vectorize an article into its respective features. You can find the features in the fill_vector() class:

classifier.py - This the initialization of the support vector machine. Basically the logistics of data preparation.

validator.py - this will help you validate the accuracy of your model.

the .txt files are basically all for data collection. Their names should explain their respective uses.

Features from the Explication paper that have been implemented:

Reputable URL ending (taken from "reputable_news_sources.txt") | boolean
whether or not a URL is from a reputable news source | boolean
number of times "Today" is written / total number of words | double
number of grammar mistakes | int
number of quotations / total number of words | double
number of past tense instances / total number of words | double
number of present tense instances / total number of words | double
number of times "should" is written / total number of words | double
whether or not "opinion" is in the URL | boolean
number of words that are in all caps / total number of words | double
whether or not a URL is from a satire news source | boolean
number of apa errors | int
number of proper nouns that occur / total number of words | double
number of interjections that occur / total number of words | double
number of times "you" occcurs / total number of words | double
Whether a URL has a dot gov ending / total number of words | double
whether a URL is from an unreputable site (taken from "unreputable_news_sources.txt") | boolean

Important Features that have not been implemented:

Fact-checking news articles
Impartial reporting
Conflict
Human interest
Prominence
Written by actual news staff
Clear About Us section
Emotionally charged words
Metadata
un/verified sources

These are the more signficant missing features. Basically, the current implemented features are the simpler, more trivial. The above features will require a lot more work.

Current classifier accuracy using support vector machine:

recall: [0.69642857 0.95535714 0.03125 0.30357143 0.89285714]
precision [0.82539683 0.87346939 0.5 0.24548736 0.50632911]
f1: [0.75544794 0.91257996 0.05882353 0.27145709 0.64620355]
{1: 68, 2: 10, 3: 217, 5: 156, 7: 24}
This model got 57.58928571428571 percent correct || 645 correct out of 1120

The indices of the recall/precision/f1 lists represent the labels of the type of news article: 1 = real news 2 = fake news 3 = opinion news 5 = polarized news 7 = satire data

As you can see, with the current 5 categories that have been implemented, there is a 58% accuracy, in the case that the base level without a classifier, and just tossing a coin is 20% accurate.

In order to use the classifier, first you must collect data. To do this use the prepare_data() method from classifier.py. The input is a dictionary with data text files as keys and their corresponding labels. see training_file_dict as an example.

support_vector_machine = classifier.svm_classifier(train_X_uncombined, train_Y_uncombined)
svm_predictions = classifier.run_predictions(support_vector_machine, test_X_uncombined, test_Y_uncombined)
get_statistics(test_Y_uncombined, svm_predictions)
validate(support_vector_machine, test_X_uncombined, test_Y_uncombined) ^^these lines of code will be how you run the classifier for validation.

Important note data is separated out into urls, vectors, and then split into training and testing. There is no centralized collection of data. For example "Fake News" data will have 5 files:

fake_news_urls-testing.txt - text file with fake news urls separated by spaces for testing
fake_news_urls-training.txt - text file with fake news urls separated by spacesA for training
fake_news_urls.txt - All fake news URLs compiled into one text file.
fake_news_vectors-testing.txt - The corresponding fake news testing URLs, from fake_news_urls-testing but vectorized into their respective features.
fake_news_vectors-training.txt - The corresponding fake news training URLs, from fake_news_urls-training but vectorized into their respective features.

Good Luck.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
__pycache__		__pycache__
language-check		language-check
newspaper		newspaper
opinion-or-fact-sentence-classifier		opinion-or-fact-sentence-classifier
python-goose		python-goose
Feature Engineering with newspaper.py		Feature Engineering with newspaper.py
README.md		README.md
Research Article.docx		Research Article.docx
ap_style_checker.py		ap_style_checker.py
ap_style_checker_tests.py		ap_style_checker_tests.py
classifier.py		classifier.py
data_collection_with_newspaper.py		data_collection_with_newspaper.py
fake_dataset.xlsx		fake_dataset.xlsx
fake_news_urls-testing.txt		fake_news_urls-testing.txt
fake_news_urls-training.txt		fake_news_urls-training.txt
fake_news_urls.txt		fake_news_urls.txt
fake_news_vectors-testing.txt		fake_news_vectors-testing.txt
fake_news_vectors-training.txt		fake_news_vectors-training.txt
feature_extraction.py		feature_extraction.py
opinion_urls-testing.txt		opinion_urls-testing.txt
opinion_urls-training.txt		opinion_urls-training.txt
opinion_urls.txt		opinion_urls.txt
opinion_vectors-testing.txt		opinion_vectors-testing.txt
opinion_vectors-training.txt		opinion_vectors-training.txt
polarized_news_urls-testing.txt		polarized_news_urls-testing.txt
polarized_news_urls-training.txt		polarized_news_urls-training.txt
polarized_news_urls.txt		polarized_news_urls.txt
polarized_news_vectors-testing.txt		polarized_news_vectors-testing.txt
polarized_news_vectors-training.txt		polarized_news_vectors-training.txt
real_news_urls-testing.txt		real_news_urls-testing.txt
real_news_urls-training.txt		real_news_urls-training.txt
real_news_urls.txt		real_news_urls.txt
real_news_vectors-testing.txt		real_news_vectors-testing.txt
real_news_vectors-training.txt		real_news_vectors-training.txt
reputable_news_sources.txt		reputable_news_sources.txt
satire_news_sources.txt		satire_news_sources.txt
satire_urls-testing.txt		satire_urls-testing.txt
satire_urls-training.txt		satire_urls-training.txt
satire_urls.txt		satire_urls.txt
satire_vectors-testing.txt		satire_vectors-testing.txt
satire_vectors-training.txt		satire_vectors-training.txt
test-file.py		test-file.py
tf_idf_vectorizer.py		tf_idf_vectorizer.py
unreputable_news_sources.txt		unreputable_news_sources.txt
validator.py		validator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages