This realisation for parse domain name in msgtype=response in raw data WARC file on Amazon EMR.
And summary domain name in reduce, send result for SQS Amazon. Input files get form url.
Special for Noah Silverman.
Example usage:
s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-40/segments/1443736672328.14/warc/CC-MAIN-20151001215752-00000-ip-10-137-6-227.ec2.internal.warc.gz
elasticmapreduce/outDir/
aws-logs-xxx-us-west-2
sqsQueueName
us-west-2
accessKey
secretKey
never-summer/emr-simple
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|