Skip to content

mercury-protocol/litepaper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Mercury Protocol Litepaper

Cost-efficient, scalable and easy-to-use compute network for training AI models

🧬 Mercury is a work in progress. Active research and development is underway, new versions of this paper will appear here and at mercuryprotocol.io. For comments and suggestions, contact us at research@mercuryprotocol.io.

Authors: Lajos Deme, Péter Berekvölgyi

Originally published September 2024. Republished here May 2026 for archival purposes — figures, roadmap, and references reflect the state of the project at the time of writing.

Introduction

Artificial intelligence is poised to revolutionize all societal constructs and facets of life. But as it stands today, there are various risks and problems with the way the development of this technology is progressing.

The most obvious one is rigid centralization. The exorbitant costs of training complex models is a moat that limits access to only big tech giants. This monopoly on building limits the diversity of perspectives and approaches to solving problems, may result in biased algorithms and systems, and can lead to the exploitation of data and algorithms for purposes that may not align with the best interests of society, such as surveillance or manipulation.

Compute is expensive because the market is controlled by a handful of key players in an oligopolistic fashion who can charge whatever they like. Most AI startups already spend about 80% of their capital on computing power [1], and with the cost of training skyrocketing (avg. 3100% per year over the last decade [2]) it doesn’t seem to ease up any time soon.

The second largest barrier to entry is the sheer complexity of the AI training lifecycle. As it stands today, besides having significant expertise in artificial intelligence, DevOps expertise is also required. This is because the training process involves setting up complex workflows, configuring GPU clusters, and managing resources in a cost-efficient way.

If we want to see AI startups launching from someone's basement and changing the world, we need a paradigm shift in how we approach AI development. We need a novel way of accessing compute.

In this paper we present Mercury, a decentralized, cost-efficient, scalable, and easy-to-use compute network for training AI models.

Problem specification

Design goals

Affordable cost of compute

An essential requirement is that compute must cost significantly less compared to current centralized cloud compute providers.

The skyrocketing cost of compute is the most significant problem AI developers face today. The cost of training AI models has increased on average 3100% per year according to [2]. State-of-the-art AI models cost exorbitant amounts of money to train. According to [3], Google’s Gemini cost $191 million to train.

Figure 1: The training costs of state-of-the-art AI models

Figure 1: The training costs of state-of-the-art AI models

AI developers need an urgent solution to this problem.

At the same time, supplying computing power has to remain profitable compared to alternative utilization options (such as mining cryptocurrency). The pricing mechanism must take the interests of both sides into account.

Easy-to-use

We spoke with dozens of AI developers, who all lament the fact they have to be DevOps specialists besides being great at developing AI models. They want work on what they are passionate about: inventing and building new forms of AI systems, and not managing cloud instances, implementing communication among GPU clusters, and overseeing training workflows.

Thus, a viable implementation must take this load off the shoulders of AI engineers, and make it easy to get from algorithm and dataset to trained model.

Privacy preserving

To be useful for businesses and not just hobbyists and researchers, a solution must maintain the privacy of the algorithm and dataset as well. These very often have big business value and form the most important competitive advantage of the business.

Highly scalable

Finally, the implementation must be highly scalable. It must efficiently and automatically distribute compute intensive workloads among GPUs, satisfying the hardware requirements of complex deep learning models with huge datasets.

Challenges

Estimating compute requirements

Any system that automatically partitions and distributes a deep learning task among a given number of nodes must have the capability to intelligently estimate the computational requirements of that task. To do this reliably, it must calculate the floating point operations per second (FLOPs) required to train the final version of the system and also have information about the computational capability (also measured in FLOPs) of each node in the system.

Workload distribution and scheduling

The system must work with a topography of nodes and intelligently partition, schedule, and distribute incoming tasks among the nodes. The nodes must work on the given task utilizing techniques such as data and pipeline parallelism to achieve maximum scale.

Failure recovery

Any distributed system with a set of independent nodes must have proper mechanisms in place to handle unexpected node failures during work execution. This must include checkpointing the current state of the model, detecting failed nodes, and having a recovery protocol in place to select a new node to continue that part of the task.

Work verification

The implementation must be able to verify that the computation has been performed according to the requirements. This is the hardest problem to solve. There are various trade offs to consider between generalizability, speed, trustworthiness, etc.

Solution

The Mercury Protocol is a decentralized peer-to-peer GPU rental network that taps into the dormant GPU power around the world. In this section we describe the main features and the core architecture of our protocol.

Features

Cost-efficient

Building our protocol as a set of independent nodes and smart contracts removes centralized chokepoints and points of oversized value extraction. By removing the margins currently enjoyed by cloud oligopolies and connecting the world's GPUs and CPUs into a compute supercluster, we can achieve drastically lower prices and highly increased scalability.

The electricity and HVAC costs for an individual running a V100 GPU are 50-100x lower than prices for the equivalent GPU on AWS [4].

We estimate that in an open marketplace such as Mercury, compute will be about 75% cheaper compared to AWS, while still being about 20% more profitable for the compute providers than mining cryptocurrency.

Figure 2: MCY vs. AWS - Cost of training GPT-3 175B

Figure 2: MCY vs. AWS - Cost of training GPT-3 175B

Easy-to-use

AI moves at incredible speeds, so simply keeping up with the newest developments is a laborious task in and of itself for anyone. That’s why the feedback we received most often from the AI developers we spoke with was to make it simple, and don’t make them learn a whole new technology just to train a model.

We took this seriously, and removed any unnecessary actions from the UI of our beta version.

Currently it takes 5 clicks to fine-tune an LLM and just a couple of clicks to deploy a PyTorch script and a dataset to our network. After that, all the developer has to do is wait for the training job to finish and then download the model.

Figure 3: MCY vs. AWS - Ease-of-use for AI

Figure 3: MCY vs. AWS - Ease-of-use for AI

Fine-tuning an LLM can be done without writing a single line of code. Users can select any popular open-source LLM such as LLaMA 2, and they can also browse Huggingface for datasets right from the UI. Below in User flow we provide a short demo that shows how to fine-tune an LLM with Mercury.

Verifiable computation

As stated in Challenges, one of the central problems in a distributed compute protocol is verifying that the computational task has actually been performed. There have been attempts at tackling this problem, but so far there haven’t been any scalable solutions.

A primitive but unworkable solution is simply to redo the whole work. This means doubling the amount of work on the one hand, and on the other, it presupposes trust in the verifier.

More advanced and fine-grained solutions rely on proof-of-learning [5]. However, this approach has been demonstrated to be vulnerable [6].

A third option is to use Trusted Execution Environments (TEE) [7] to achieve privacy and verifiability. Until recently there have been many technical problems limiting the effectiveness and general availability of this solution. The biggest issue is that TEEs like Intel SGX [8] and Arm Trust Zone [9] are CPU technologies, while most ML computations need to be performed on the GPU. Even more problematic is the fact that computation executed inside a CPU TEE has significant overheads.

A solution to the above problems is to build a custom TEE for the GPU and execute the computation there. Most existing GPU TEE solutions require CPU and/or GPU hardware modifications, so this prevents them from being widely adopted [10], [11]. Yet other solutions rely on untrusted system software like GPU device drivers and untrusted I/O paths.

Mercury offers an alternative approach to get around these problems. Our solution achieves verifiable computation, privacy, and security without modifications to existing hardware and with modest overheads.

Our solution is based on a handful of recent papers tackling GPU TEEs [12], [13], [14]. Based on [12] we use a custom, formally verified small hypervisor that collaborates with an Intel SGX CPU to enable verifiable computing on the GPU.

The unoptimized MVP version incurs ~13% compared to unprotected GPU computation (with results confirmed by [13]). In the future, this overhead can be curbed further with deep learning specific modifications and code optimizations.

This approach provides verifiable computation, privacy, and scalability. Our current hardware requirements are: Intel CPU and NVIDIA GPU. In the future, with further developer effort we can extend our solution to any TEE-capable CPU and any GPU.

Permissionless

To achieve global adoption, the protocol must be open to anyone who wants to access or provision compute. This means that the system must be built in a way in which it is impossible to discriminate against any entity in any way, the cost of compute must not price out most participants, and to become a provider the investment in hardware must be modest.

By building our protocol as a peer-to-peer system where the dealmaking and verification procedure happens entirely on the blockchain, using automated, trustless smart contract systems we achieve non-discrimination. By removing centralized value extractors we achieve cost reduction. By building our compute environment with commodity hardware in mind, we achieve low upfront investment in hardware.

Trustless

Blockchain technology guarantees that once the system is operational, the executives of the team responsible for the development of the protocol are unable to tamper with the code already deployed. This means that no party need to trust any other party. The code itself guarantees correct behaviour.

Privacy preserving

As stated in Design Goals, the protocol must maintain the privacy of the algorithm and dataset. These are the intellectual property of the AI developer, so they must not be exposed.

Our GPU TEE solution explained in Verifiable Computation is private by default. Any model and data sent to the protocol is always kept private.

Architecture

Overview

The Mercury Protocol software stack consists of an off-chain node, a custom GPU TEE, and a suite of on-chain smart contracts.

Work request submission, price calculation, and work verification take place on-chain with our smart contracts.

The off-chain nodes form a peer-to-peer network of compute providers and collectively train machine learning models.

Participants

An off-chain node can be in one of three roles: watcher, leader, or worker.

Watcher

Watchers monitor the blockchain and pick up incoming work requests. Once they pick up a work request, they estimate the computing power needed to complete the request, given the size of the dataset, the number of layers in the neural network, the number of batches, epochs, etc.

Then they map the network to arrive at the optimal selection of nodes for the given work request. They consider the available GPU power of each node, their geographical location, etc. Finally, they distribute the work request to the selected nodes. If a node fails, the other nodes report it to the watcher, and it is the task of the watcher to select a new node.

The workers and the leader send attestations to the watcher, which the watcher verifies, and uploads to the blockchain. The only node that is allowed to upload attestations to the blockchain, and give out work requests to the other nodes, is the watcher.

In the current MVP implementation, the watcher role is permissioned. In later versions, we will remove this limitation by implementing multiple watchers and byzantine fault tolerance.

Leader

The leader is responsible for gathering the computed gradients from the workers, aggregating them, and then broadcasting the updates to all the workers. The leader listens for worker failures and reports them to the watcher. The leader is also a worker.

The leader role is present in the current MVP implementation, but in later versions it can be removed. In that case, the workers will share and aggregate gradients using all reduce.

Worker

The worker runs the supplied AI training algorithm on a given portion of the supplied dataset. It sends the computed gradients to the leader at each interval configured by the AI developer (i.e. after each batch, after every n batch, every epoch, etc.).

User flow

In our MVP we’ve built an intuitive user interface that can be used to access our protocol. AI developers can purchase USDC using Moonpay with only a credit card on our website. This then can be used to pay for compute. Currently the MVP requires Metamask to be installed, however, by using account abstraction, this limitation can be removed later on.

Once users have enough balance, they select an open-source LLM like LLaMA 2 to fine-tune, or they can upload their custom model.

Then they can search Huggingface datasets right from our UI, or they can upload their own custom dataset.

If they are fine-tuning an open-source LLM, they have the option to upload a custom config file or they can go with the default options.

After they submitted their order, they have to wait for the training to be executed. Then they will get a link where they can download the trained model.

https://www.youtube.com/watch?v=07Y_T2euB6U

Vision and future work

The current version of Mercury Protocol proves that providing cheap, scalable, and private compute to AI developers is indeed possible.

We will move from MVP to private beta in the coming months. Then our efforts will be focused on getting Mercury Protocol into production as soon as possible.

As stated above, future developer work will be dedicated to reducing the current overheads and extending the list of compatible CPU and GPUs.

We will work towards making Mercury the default option for all businesses, developers, and researchers who work on training artificial intelligence globally.


[1] - Navigating the High Cost of AI Compute**,** a16z**,** 2023

[2] - Messari on Twitter, 2023

[3] - Artificial Intelligence Index Report, Stanford University, 2024

[4] - Yuan, B., He, Y., Davis, J. Q., Zhang, T., Dao, T., Chen, B., Liang, P., Re, C., & Zhang, C. (2022). Decentralized Training of Foundation Models in Heterogeneous Environments. ArXiv. /abs/2206.01288

[5] - Jia, H., Yaghini, M., A., C., Dullerud, N., Thudi, A., Chandrasekaran, V., & Papernot, N. (2021). Proof-of-Learning: Definitions and Practice. ArXiv. /abs/2103.05633

[6] - Fang, C., Jia, H., Thudi, A., Yaghini, M., A., C., Dullerud, N., Chandrasekaran, V., & Papernot, N. (2022). Proof-of-Learning is Currently More Broken Than You Think. ArXiv. /abs/2208.03567

[7] - Trusted execution environment. (2024, April 24). In Wikipedia.

[8] - Software Guard Extensions. (2024, March 1). In Wikipedia.

[9] - ARM Security Technology, Building a Secure System using TrustZone Technology. (2009)

[10] - Stavros Volos, Kapil Vaswani, and Rodrigo Bruno. Graviton: Trusted execution environments on GPUs. In OSDI, 2018

[11] - Insu Jang, Adrian Tang, Taehoon Kim, Simha Sethumadhavan, and Jaehyuk Huh. 2019. Heterogeneous Isolated Execution for Commodity GPUs. In 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2019). ACM, Providence, RI, 455–468.

[12] - Xiaolong Wu, Dave (Jing) Tian, and Chung Hwan Kim. 2023. Building GPU TEEs using CPU Secure Enclaves with GEVisor. In SoCC '23: Proceedings of the 2023 ACM Symposium on Cloud Computing.

[13] - Ivanov, A., Rothenberger, B., Dethise, A., Canini, M., Hoefler, T., & Perrig, A. (2022). SAGE: Software-based Attestation for GPU Execution. ArXiv. /abs/2209.03125

[14] - HaoHui Mai, Jiacheng Zhao, Hongren Zheng, Yiyang Zhao, Zibin Liu, Mingyu Gao, Cong Wang, Huimin Cui, Xiaobing Feng and Christos Kozyrakis. 2023. Honeycomb: Secure and Efficient {GPU} Executions via Static Validation. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors