Open source toolset for managing Apache Spark resources and job orchestration.
- CLI tool for managing Spark clusters and related resources
- Python package for integrating setup into job code
- All resources managed with single configuration file
- Local development environment mirroring production environment
- Required libraries all stored in single location
NOTE: Still working on initial release
Modes
local: Single node cluster setup on local machinestandalone: Multi node cluster using Spark's standalone setupyarn: Multi node cluster built using Hadoop YARN setup (future release)kubernetes: Multi node cluster built Kubernetes setup (future release)
Services
- Cluster manager: Start/stop clusters, Spark UI for cluster
- JDBC access server: Integrated HIVE Thriftserver for JDBC calls to Spark warehouse
- Metastore: Central SQL server managing HIVE metastore (future release)
- Job orchestrator: Job scheduler via cron (future release)
- History server: SparkUI for past runs (future release)
Once installed, the simplespark command can be used to
create and switch between different Spark environments.
Each environment contains a single cluster and a collection
of resources to run on the cluster.
The simplespark library is written in Python and can be
installed in two different ways:
For machines with Python already installed, use pip to install
both the Python module and the CLI tool.
pip install simplesparkTODO
The configuration can be expressed in a single JSON file or be defined in multiple files which are merged on import.
Each mode will require different configuration proprieties
to be defined and within each mode there are optional settings
for specific add ins.
The easiest way to start is to create a template for the specific mode by running the command below:
simplespark template <mode> <file-path>Activating a specific environment sets the JAVA/SCALA/SPARK_HOME variables
for a shell session to point to that environment, as well as any other shell
updates required for specific build.
This is done by calling an "activation" script generated during build:
source <environment-name>.spark
The environment will only be activated for the session in which this command called so that it is possible to interact with multiple environments at once.
The following native Spark commands will automatically connect to the activated environment:
spark-shellspark-submitpysparkspark-sql
An environment can be started and stopped which will spin up or down the associated cluster and any additional resources defined in the configuration.
# Not required if already in desired environment
source <environment-name>.spark
simplespark start
simplespark stopFor all configurations, the following proprieties must be defined.
name: Identifier of simplespark "environment"simplespark_home: Full path to directory used by simplespark to:- Store environment configurations
- Store any required libraries
- Store any scripts or custom modifications
bash_profile_file: Full path to bash profile file used to setSIMPLESPARK_HOMEenvironment variablepackages: TODOdriver: TODO