DS561_hw2

Steps to run this project:

if you have already generated the files and copied to your bucket then you dont have to run the generate-content.py file Now:

download the tasks.py file to you local computer. then upload the file to your bucket directory where you have the html files stored.
then just have to run the .py files (please check the permission of you bucket as it has to be publicly accessible)
Once you run the file you get the below out put, first the statistics of the incoming_links and outgoing_links, then followed by the pagerank and graph:

IMPORTANT NOTES:

The project name is ->u62138442-hw2
Bucket name -> u62138442_hw2_b2 You just need to download the tasks.py file and run it the cloud shell with having the html files within the same bucket.

Github link:

url : https://github.com/aishreddy13/DS561_hw2 (but you have to accept the invite before you access)
SSH: git@github.com:aishreddy13/DS561_hw2.git
HTTP: https://github.com/aishreddy13/DS561_hw2.git

Below I have explained how I wrote the code and also what ways I tried ti run the long data files:

In the first part of the code it processes all the html files and calculates the asked operations like finding Average, Median, Max, Min and Quintiles of incoming and outgoing links across all the html files and the output is something like this:

And in the second part of the code I have calculated the PageRank, but took very long time then I created a VM with name ‘instance-1’

I tried every way to run my code but due to the long running time it did not work but I am gonna explain all the ways I tried to make it work. Steps in code: • first initialize the Google Cloud Storage client and specify the bucket name. • then, list all blobs (files) in the specified bucket. • Then, iterate through the blobs and analyze HTML files: • Download the HTML content. • Parse the HTML with BeautifulSoup. • Find all anchor () tags in the HTML. • Count incoming and outgoing links. • Store link counts in dictionaries (incoming_counts and outgoing_counts). • You calculate statistics (average, median, max, min, and quintiles) for incoming and outgoing links and print them. • You then proceed to calculate PageRank: • Initialize PageRank values for all pages. • Define a convergence threshold. • Perform an iterative PageRank calculation. • Calculate the contribution from incoming links and update PageRank values. • Check for convergence based on the total change in PageRank values. • Output the top 5 pages by PageRank score.

VM ssh browser where I have cloned the git:

Cloud shell terminal where I have runned the code multiple times in a different ways:

Billing information: used VM for this and I cost me around 7.79 it is because I used it with the image disk (ubuntu) as I was getting very long running period.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
homework 2		homework 2
homework 3		homework 3
homework4		homework4
homework5		homework5
homework6		homework6
homework7		homework7
homework8		homework8
README.md		README.md
app2.py		app2.py
cloudFunc.py		cloudFunc.py
generate-content.py		generate-content.py
http-client.py		http-client.py
tasks.py		tasks.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DS561_hw2

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DS561_hw2

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages