Skip to content

Add support for wildcard HostAliases #64

@Turnerj

Description

@Turnerj

Follows on from discussions in #63 - currently the HostAlias setting is relatively limited, requiring an exact match before it crawls a link with that domain.

To make crawling a large number of subdomains easier, support for a wildcard (*) would be useful.

eg.

using InfinityCrawler;

var crawler = new Crawler();
var result = await crawler.Crawl(new Uri("http://example.org/"), new CrawlSettings {
	UserAgent = "MyVeryOwnWebCrawler/1.0",
	RequestProcessorOptions = new RequestProcessorOptions
	{
		MaxNumberOfSimultaneousRequests = 5
	},
	HostAliases = new [] { "*.example.org" }
});

There likely doesn't need to be any specific rules around wildcard handling. A host alias that is only a wildcard would indicate crawling any domain linked to. This is likely where analyzers of some kind would be useful as well as additional documentation.

A full wildcard setup does allow crawling of more complex subdomains like web.*.example.org, which may help in some specific usecases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions