Skip to content

Replace pickle with a more secure serialization format (JSON/Msgpack) #263

@cbonilla20

Description

@cbonilla20

Have you searched if there an existing feature request for this?

  • I have searched the existing requests

Feature description

Description

Currently, the library utilizes the pickle module for [state persistence/checkpointing]. While pickle is convenient for Python-to-Python object serialization, it poses a significant security risk and introduces maintenance challenges.

The Problem: Insecure Deserialization
The primary issue with pickle is that it is not secure against erroneous or maliciously constructed data. Because pickle can instantiate arbitrary objects and call functions during the loading process, loading a crafted .pkl file can lead to Remote Code Execution (RCE).

  • Compatibility: pickle files are often tied to specific Python versions or class structures. If the internal code of Scrapling changes, old pickle files may become unreadable (AttributeErrors/ModuleNotFoundError).
  • Interoperability: Formats like JSON or Msgpack allow the state to be inspected or used by other tools/languages if necessary.

Proposed Solution

I suggest migrating the serialization layer to a safer alternative:

  • JSON: Best for simple data structures and human readability.
  • Msgpack: Best for high-performance, binary serialization that remains safe (it doesn't execute code).
  • Marshmallow/Pydantic: To handle the mapping of complex objects to/from dictionaries before saving to JSON.

Suggested Impact

  • Security: Eliminates the RCE attack vector entirely.
  • Reliability: Better handling of version mismatches between saved states and the current library version.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions