Skip to content

Generic XML (and RSS) Support #2

@jpillora

Description

@jpillora

Currently, scraper doesn't work very with normal XML documents, specifically - RSS.

There's a few ways to solve this.

  1. The included CSS selector engine (https://github.com/PuerkitoBio/goquery) doesn't parse XML properly it seems. We could modify the HTML to make it conform. OR

  2. Each configured path could have a mode, where:

    1. html implies selector is a CSS selector (and is the default mode)
    2. xml implies selector is an XPATH selector (eww) OR
    3. xml implies selector is a new format: foo bar, simply traverses into <foo> then into <bar> OR
    4. xml removes all other settings, and simply converts XML into JSON directly (this is probably the easiest)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions