Skip to content

Make LD parser more resilient #19

@gajus

Description

@gajus

$html('script[type="application/ld+json"]').each((index, item) => {
try {
let parsedJSON = JSON.parse($(item).text())
if (!Array.isArray(parsedJSON)) {
parsedJSON = [parsedJSON]
}
parsedJSON.forEach(obj => {
const type = obj['@type']
jsonldData[type] = jsonldData[type] || []
jsonldData[type].push(obj)
})
} catch (e) {
console.log(`Error in jsonld parse - ${e}`)
}
})

The current JSON-LD parser assumes a perfect world scenario.

Here is how I've implemented a LD+JSON parser in my local project:

(html: string): $ReadOnlyArray<Object> => {
  const dom = new JSDOM(html);

  const nodes = Object.values(dom.window.document.querySelectorAll('script[type="application/ld+json"]'));

  return nodes.map((node) => {
    if (!node || typeof node.innerHTML !== 'string') {
      throw new TypeError('Unexpected content.');
    }

    let body = node.innerHTML;

    debug('body', body);

    // Some websites (e.g. Empire) have JSON that includes new-lines, i.e. invalid JSON.
    body = body.replace(/\n/g, '');

    // Some website (e.g. Variety) have JSON that is surrounded in CDATA comments, e.g.
    // https://gist.github.com/gajus/4a2653b4a5235ccebedc44467a2896f2
    body = body.slice(body.indexOf('{'), body.lastIndexOf('}') + 1);

    return JSON.parse(body);
  });
};

Thus far it works with all the sites I have been testing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions