vmsync: VM replication tool: Discussion #1

deajan · 2026-04-10T13:25:04Z

deajan
Apr 10, 2026
Collaborator

First of all, sorry if this isn't the right place to ask this question, and sorry if the question is quite broad.

I'm designing various KVM/qemu disaster recovery solutions, mostly based upon storage replication (zfs send/receive, lvm thin send/receive, drbd...).

So far, all of the solutions are at block level, and are quescing agnostic (at least they must trigger a intermediary freeze/thaw script for said VMs). I tend to use them because of the low RPO/RTO which is difficult to achieve with backups.

Reading the technial design of virtnbdbackup, I noticed that it has almost everything needed to make a top notch disaster recovery solution, eg differential backup that includes changed block tracking including fstrim support and quescing support out of the box, plus "multi" snapshots (ie backing chains if I understood well).

Now I am wondering how much of a gap there would be to use virtnbvbackup as disaster recovery solution with a low RTO/RPO, ie replicating changes every 5 minutes, and be able to start a VM directly from backup, while copying a flattened version of said VM storage in background and then make a pivot once copy is complete.

Is that a usecase virtnbdbackup can handle ?

Thank you for any insight.

abbbi · 2026-04-10T13:37:45Z

abbbi
Apr 10, 2026
Maintainer

hi,

it was mostly designed as backup tool. It could serve as replication tool too if this would be implemented:

abbbi/virtnbdbackup#57

currently the backup data is stored in a special format that does not allow to boot a virtual machine directly from
the backup. The data must be processed using the virtnbdrestore utility beforehand.

If it would support storing the backup data as qcow images, yes, it could be used as a "replication tool".

At the time i started the implementation i wanted a streamable format for the backup data (which is not quite possible with qcow images).

Besides that, i think, the Qemu stack has other implementations to replicate virtual machines, it is called COLO.

https://wiki.qemu.org/Features/COLO

Not sure if this can be setup via libvirt.

FWIW, using the dirty bitmap mechanism to create block level "backups" currently also requires filesystem freeze/thaw in the virtual machine, so that is not different from a storage level approach of replicating the data.

I have already had thoughts about designing a replication tool around the dirty bitmap feature, but i think this would then be a complete new implementation and probably in another language (golang).

10 replies

abbbi Apr 13, 2026
Maintainer

i would imagine something like this:

https://grinser.de/~abi/vmsync/

i still need to decide on a license and where to place its source, if im going to release.
The first alpha however seems to work fine in my tests.

deajan Apr 14, 2026
Collaborator Author

Wow... Very fast for someone who didn't want to develop a thing ;)
I'll have a couple of test runs today and report back.

Btw, knowing your other projects, I kind of expected vmsync to be Python based.
As far as I can read into the binary, it's a go implementation.
I would love to maintain vmsync and add the features that are currently limiting, but won't be able to do this in go, being a python dev myself.

This is subject to your licence decision anyway.
Again, if I can't maintain this for technical reasons, I am at least willing to sponsor it (or yet even sponsor your PoC).

Let me know If you want to discuss the matter.

abbbi Apr 14, 2026
Maintainer

hi,

well, i've gotten kind of tired about the whole python dependency management hell and keeping projects like virtnbdbackup up and running amongst a different fleet of linux distributions is quite a pain in the ass. Releasing rpm packages working for all the fedora releases alone is a struggle each time if major python versions change (that requires changes and testing each time)

At some point i even had a complete rewrite in golang in mind, the language provides all required bindings by the same upstream projects (libvirt and guestfs/libnbd).. so naturally this time, i went with golang.

deajan Apr 14, 2026
Collaborator Author

I have the same decision running in my head about my own backup project which is coded in Python and where I maintain signed Nuitka compiled binaries for windows / linux for x64,x86,arm and arm64 which is quite some hell to do ;)
I currently think about going the rust way, so yes, I totally understand.

I've tried vmsync but so far I couldn't get it to work because I try to replicate VMs from EL9 to EL10.
On EL9, I'm stuck with libnbd-1.20.3-4.el9.x86_64 so I get the error ./vmsync_linux_amd64: /lib64/libnbd.so.0: version LIBNBD_1.22' not found (required by ./vmsync_linux_amd64)` when trying to run.
Is that a hard dependency or is it just a compile time recommandation ?

I'm dead serious, the tool you're setting up is the only one I am aware of (and yes, I've searched the internet for a while for this) that could actually solve a KVM disaster recovery scenario without a big overhead.
I'm currently running KVM DR with zfs replication (but it has it's own problems with write amplification and VM on CoW performance penalty), had running LVM thin replication (a pain to setup and maintain), and various other solutions I've tried.

If you can spent a minimum amount of time on it and open the source code on a GPL/MIT/BSD/whever as long as the community gets to use it licence, I'll happily chip in some money.
I do of course understand that it would be a slow side project.

abbbi Apr 14, 2026
Maintainer

its built on debian13 which is my primary dev platform. AS other modules are cgo, its linked against libnbd shared thus requiring the version from debian13. You cold execute within a docker container until i come around creating other builds.

abbbi · 2026-04-14T11:45:35Z

abbbi
Apr 14, 2026
Maintainer

ive moved this discussion to the (maybe) future repository.

Im learning as we go.. so ive setup dockerized builds for all kinds of distributions now, the latest releases are available at:

https://grinser.de/~abi/vmsync/releases/

The rocky linux builds should be binary compatible to the according RHEL releases.
Please note that cross sync rhel9->rhel10 might not work, as virtual machines on rhel9 might use features not available anymore on rhel10, so define could fail.

0 replies

deajan · 2026-04-14T11:50:46Z

deajan
Apr 14, 2026
Collaborator Author

I see... Yet I am not really willing to overcomplexify my setups at this point (I already have bridges + openvswitch instances, so adding docker on top of this will require me to make firewall rules and various network exceptions).

I decided to add another EL10 host, so I could just "play" without any further setup requirements.
Yet I did not succeed so far:

I'm connecting from hyper01p to hyper02p. So far my ssh setup seems to be working without any major issue (hosts files are filled with ips, I allowed root ssh connexion for this setup, authorized_keys files and ssh host config is done):

[root@hyper01p ~]# ssh -i /root/.ssh/hyper02p root@hyper02p
[BANNER / MOTD messages]
[root@hyper02p ~]#

When I try to use the same settings with vmsync I cannot get a connection yet:

[root@hyper01p ~]# ./vmsync_linux_amd64 --source-domain trilium01p.local --source-uri qemu:///system --target-uri qemu+ssh://hyper02p/system --output-dir /tmp --ssh-key /root/.ssh/hyper02p
2026/04/14 13:45:52 discovered 1 qcow2 disks
2026/04/14 13:45:52 source URI does not use SSH; qemu-img info will run locally
2026/04/14 13:45:52 running local qemu-img info for disk=vda path=/data/private_vm/trilium01p.local-disk0.qcow2
2026/04/14 13:45:52 disk vda: format=qcow2 virtual-size=26843545600 path=/data/private_vm/trilium01p.local-disk0.qcow2
2026/04/14 13:45:52 sync failed: connect ssh for target file/export execution: ssh dial hyper02p:22: ssh: handshake failed: knownhosts: key is unknown

I've removed my /root/.ssh/knownhosts file and made sure it's recreated. I use ed25519 keys so my known hosts has 3 entries for the same host:

192.168.80.202 ssh-ed25519 AAAA
192.168.80.202 ssh-rsa AAAAB3Nza[...]/
192.168.80.202 ecdsa-sha2-nistp256 AAAAE2VjZ[...]

Any chances that the go lib for ssh doesn't like that key format ?

2 replies

abbbi Apr 14, 2026
Maintainer

-ssh-insecure-host-key should help here.

have the same issue on my system its not yet very clear, i think its in the golang crypto package beeing very strict.

deajan Apr 14, 2026
Collaborator Author

Indeed, got it to work with -ssh-insecure-host-key.
I have another setup (a EL9 to EL10) which does work with ssh. I'll investigate further to find what the culprit could be, since I have two sets of machines, one working, another not working with ssh.

deajan · 2026-04-14T12:01:42Z

deajan
Apr 14, 2026
Collaborator Author

I've also grabbed the rocky linux 9.3 release to use on my AlmaLinux 9.7 to AlmaLinux 10.1 test setup.
Interestingly, on these machines (different to the EL10 to EL10 above), I also use ed25519 ssh keys but don't encounter that bug, so I'll investigate into my issue.
The machines I describe here already have a replication done via zfs.

I've tried to run the replica as below:

./vmsync --source-domain www01p.local --source-uri qemu:///system --target-uri qemu+ssh://hv01.local/system --output-dir /tmp --ssh-key /root/.ssh/hyper01p --target-domain www01p.repl.local --output-dir /opt
2026-04-14T13:56:06+02:00 INFO discovered source domain domain=www01p.local
2026/04/14 13:56:06 INFO skipping cdrom device device=sda
2026-04-14T13:56:06+02:00 INFO discovered qcow2 disks count=1
2026-04-14T13:56:06+02:00 INFO source URI does not use SSH; qemu-img info will run locally
2026-04-14T13:56:06+02:00 INFO running local qemu-img info disk=vda path=/private_vm/vm/www01p.local-disk0.qcow2
2026-04-14T13:56:06+02:00 INFO disk info disk=vda format=qcow2 virtual_size=26843545600 path=/private_vm/vm/www01p.local-disk0.qcow2
2026-04-14T13:56:07+02:00 ERROR sync failed error=target disk already exists on target host: /private_vm/npf/www01p.local-disk0.qcow2

My primary thought is that --output-dir is supposed to make sure we can write our qcow2 file to some predefined path in destination host, but vmsync complains that there is already an existing qcow2 image in the original path (which of course is the case because if replication).
Is that a desired behavior ? If so, I'll just try to replicate a VM clone that doesn't exist on target system yet.

0 replies

abbbi · 2026-04-14T12:04:39Z

abbbi
Apr 14, 2026
Maintainer

-output-dir is something that ive used for testing locally. Its of no use currently an i think ill remove it.
The curren behavior is:

if the first replication is done, it fails if the target image already exists, because i dont want to overwrite data in these cases, as i dont know about the state of the target image regards fstrim etc..
on incrementals, this check is disabled
the current implementation wants a "green field" on the target system, as in; VM and files not beeing existant before first full sync.
it will allways use the same target locations on the target system for the disks

1 reply

deajan Apr 14, 2026
Collaborator Author

Makes perfect sense. Currently doing my first VM replication on the same subnet.
Going to try a remote to remote replication in the next hours :)

deajan · 2026-04-14T12:16:42Z

deajan
Apr 14, 2026
Collaborator Author

Initial copy result:

[...]
2026/04/14 14:07:22 nbd: copy progress  100.00% (20513619968/20513619968 bytes) 75.07 MiB/s
2026/04/14 14:07:22 nbd: copied 20513619968 bytes to target
2026/04/14 14:07:22 backup abort requested (cleanup), stopping libvirt backup job

Second run result:

2026/04/14 14:13:01 target nbd export started for /data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2 on hyper02p:20809
2026/04/14 14:13:01 copy 225 extents to remote target /data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2 (disk-size=26843545600)
2026/04/14 14:13:01 nbd: copy progress  100.00% (10747904/10747904 bytes) 42.41 MiB/s
2026/04/14 14:13:01 nbd: copied 10747904 bytes to target
2026/04/14 14:13:01 backup abort requested (cleanup), stopping libvirt backup job

Perhaps the only thing I would request is to change the wording backup abort requested which feels kind of bad ;)
I'll continue my tests on my other systems.

0 replies

deajan · 2026-04-14T12:50:40Z

deajan
Apr 14, 2026
Collaborator Author

Next tests:

Noticed that the VMs need to be running to be able to replicate. Of course I do understand that a MVP does not include "side quest" options. But it would be a nice feature for the future ;)

2026-04-14T14:21:51+02:00 ERROR sync failed error=source domain www01p-test.npf.local is inactive require running state before sync

I've finally replicated a VM over a WAN link:

2026-04-14T14:38:04+02:00 INFO nbd: copy progress  100.00% (12883066880/12883066880 bytes) 15.62 MiB/s
2026-04-14T14:38:04+02:00 INFO nbd copy complete written_bytes=12883066880
2026-04-14T14:38:04+02:00 INFO backup abort requested, stopping libvirt backup job trigger=cleanup

Second run:


[root@hyper02p ]# ./vmsync --source-domain www01p-test.local --source-uri qemu:///system --target-uri qemu+ssh://hv01.netperfect.org/system --output-dir /tmp --ssh-key /root/.ssh/hyper01p --debug
2026-04-14T14:41:03+02:00 INFO discovered source domain domain=www01p-test.npf.local
2026/04/14 14:41:03 INFO skipping cdrom device device=sda
2026-04-14T14:41:03+02:00 INFO discovered qcow2 disks count=1
2026-04-14T14:41:03+02:00 INFO source URI does not use SSH; qemu-img info will run locally
2026-04-14T14:41:03+02:00 INFO running local qemu-img info disk=vda path=/private_vm/vm/www01p-test.local-disk0.qcow2
2026-04-14T14:41:03+02:00 INFO disk info disk=vda format=qcow2 virtual_size=26843545600 path=/private_vm/vm/www01p-test.local-disk0.qcow2
2026-04-14T14:41:04+02:00 INFO created incremental checkpoint checkpoint=vmsync-cpt-000002 parent=vmsync-cpt-000001
2026-04-14T14:41:04+02:00 INFO starting incremental pull backup parent_checkpoint=vmsync-cpt-000001 new_checkpoint=vmsync-cpt-000002
2026-04-14T14:41:04+02:00 INFO libvirt backup export started on source via tcp host=127.0.0.1 port=10809
2026-04-14T14:41:04+02:00 INFO reading disk via libvirt backup NBD tcp export disk=vda export=vda
2026-04-14T14:41:04+02:00 INFO nbd connect for extents host=127.0.0.1 port=10809 export=vda checkpoint=vmsync-cpt-000001 incremental=true
2026-04-14T14:41:04+02:00 INFO nbd connected for extent query export=vda
2026-04-14T14:41:04+02:00 INFO nbd export size export=vda bytes=26843545600
2026-04-14T14:41:04+02:00 INFO nbd extent scan progress export=vda offset=4294967295 size=26843545600
2026-04-14T14:41:04+02:00 INFO nbd extent scan progress export=vda offset=8589934590 size=26843545600
2026-04-14T14:41:04+02:00 INFO nbd extent scan progress export=vda offset=12884901885 size=26843545600
2026-04-14T14:41:04+02:00 INFO nbd extent scan progress export=vda offset=17179869180 size=26843545600
2026-04-14T14:41:04+02:00 INFO nbd extent scan progress export=vda offset=21474836475 size=26843545600
2026-04-14T14:41:04+02:00 INFO nbd extent scan progress export=vda offset=25769803770 size=26843545600
2026-04-14T14:41:04+02:00 INFO nbd extent scan progress export=vda offset=26843545600 size=26843545600
2026-04-14T14:41:04+02:00 INFO nbd extent scan complete export=vda extents=275 selected=134
2026-04-14T14:41:06+02:00 INFO target nbd export started path=/private_vm/vm/www01p-test.local-disk0.qcow2 host=myredactedhost.tld port=20809
2026-04-14T14:41:06+02:00 INFO copy extents to remote target extents=275 path=/private_vm/vm/www01p-test.local-disk0.qcow2 disk_size=26843545600
2026-04-14T14:41:07+02:00 INFO nbd: copy progress  16.33% (4194304/25690112 bytes) 3.90 MiB/s
2026-04-14T14:41:08+02:00 INFO nbd: copy progress  48.72% (12517376/25690112 bytes) 5.87 MiB/s
2026-04-14T14:41:09+02:00 INFO nbd: copy progress  71.94% (18481152/25690112 bytes) 5.80 MiB/s
2026-04-14T14:41:10+02:00 INFO nbd: copy progress  88.27% (22675456/25690112 bytes) 5.35 MiB/s
2026-04-14T14:41:11+02:00 INFO nbd: copy progress  98.98% (25427968/25690112 bytes) 4.80 MiB/s
2026-04-14T14:41:11+02:00 INFO nbd: copy progress  100.00% (25690112/25690112 bytes) 4.73 MiB/s
2026-04-14T14:41:11+02:00 INFO nbd copy complete written_bytes=25690112
2026-04-14T14:41:11+02:00 INFO sync successful cleaning up parent checkpoint parent=vmsync-cpt-000001
2026-04-14T14:41:11+02:00 INFO backup abort requested, stopping libvirt backup job trigger=cleanup

So far so good, everything works ;)
I'm just puzzled as of the network performance over ssh.
I use to replicate zfs snapshots on the same hosts, and get roughly 60MiB/s (raw, not compresed), using the same ssh settings.
I've checked the qemu-nbd process that didn't go past 30% cpu. Wonder were the culprit could be. I'll investigate too ;)

3 replies

abbbi Apr 14, 2026
Maintainer

the actual data is not transferred via ssh but directly via NBD.
The ssh connection is only used for creating the target images using qemu-img.

Data is read from 127.0.0.1 port=10809 (Source NBD exposed by backup)
And sent to NBD endpoint on: host=myredactedhost.tld port=20809 (target NBD server)

As for the source image to be active: there is the possibility to start the VM in pause mode (like virtnbdbackup -S option does) to be able to synchronize.. thats something for future developments.

deajan Apr 14, 2026
Collaborator Author

the actual data is not transferred via ssh but directly via NBD.
The ssh connection is only used for creating the target images using qemu-img.

I think I am right to guess that nbd still discusses via a ssh tunnel ? I mean the data cannot flow anywhere else but via ssh.

abbbi Apr 14, 2026
Maintainer

I think I am right to guess that nbd still discusses via a ssh tunnel ? I mean the data cannot flow anywhere else but via ssh.

no, it doesnt. On the remote system the qemu-nbd server is listening on port 20809. Its not tied to libvirt or anything lese just started via ssh by the utility and killed after transfer finishes:

20526 ? Rsl 0:03 qemu-nbd --fork --persistent --discard=unmap --format=qcow2 --bind 0.0.0.0 --port 20809 --pid-file /tmp/vmsync-qemu-nbd-fstrim-sda.pid /tmp/tmp.we3yOvdptW/fstrim.qcow2

it exposes the target file /tmp/tmp.we3yOvdptW/fstrim.qcow2 via NBD on Port 20801

synchronization hapepns by reading from backup NBD server (nbd.pread(..buffer)) to qemu-nbd prozess target 20901 (nbd.pwrite(buffer..))

In my local test setup (nested VM running on Nested VM) i get around 100 MiB/s sync speed, which is not quite that awesome but i guess its to be expected using nested virtualization.

2026-04-14T16:10:05+02:00 INFO nbd: copy progress  50.31% (20871577595/41489727488 bytes) 103.69 MiB/s
2026-04-14T16:10:06+02:00 INFO nbd: copy progress  50.59% (20989018107/41489727488 bytes) 103.70 MiB/s
2026-04-14T16:10:07+02:00 INFO nbd: copy progress  50.85% (21098070011/41489727488 bytes) 103.68 MiB/s
2026-04-14T16:10:08+02:00 INFO nbd: copy progress  51.13% (21215510523/41489727488 bytes) 103.69 MiB/s

deajan · 2026-04-14T16:04:31Z

deajan
Apr 14, 2026
Collaborator Author

no, it doesnt. On the remote system the qemu-nbd server is listening on port 20809. Its not tied to libvirt or anything lese just started via ssh by the utility and killed after transfer finishes:

My assumption that you would send the nbd output via ssh was wrong just because I didn't need to open a firewall port, which was normal since I configured my firewall to accept any traffic between both hosts.
So basically, I need to create a VPN between both hosts in order to ensure encrypted communications.
I wonder whether putting some pipe options on both sides would be a good idea.
Something along --pipe-options="zstd --fast -T0 | openssl enc -aes-256-cbc -pbkdf2 -k MySuperSecretPassword | mbuffer -s 128k -m 1G" on sender and --pipe-options="mbuffer -s 128k -m 1G | openssl enc -aes-256-cbc -pbkdf2 -d -k "MySuperSecretPassword" | zstd -d" on the receiver side so a user gets to decide optional buffer/compression & encryption.
Yet, that's another feature which I guess is out of scope right now. Also, since you read/write directly via buffered go functions, I wonder if that's doable at all, or if it needs to be implemented directly.

I'm currently investigating two things:

ssh: handshake failed: knownhosts: key is unknown on a hosts set whereas it works on another where I used the exact same host key generation method and the exact same security settings
performance issue between remote hosts that both have enterprise grade WAN links where zfs replication achieves 60MiB/s. I will probably launch a qemu-nbd server manually and trigger a manual backup in order to measure raw performance.

Will report back once I get some answers.
For now, I really enjoy the tool, and will setup a cron task on my non prod servers just to play with it for a while.

0 replies

deajan · 2026-04-14T16:38:54Z

deajan
Apr 14, 2026
Collaborator Author

While continuing tests, I think I found a blocker I guess.
This is my system state after a couple of replications:

On source system, disk is 20G:

[root@hyper01p ]# ls -alhs /data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2
20G -rw-------. 1 qemu qemu 20G 14 avril 18:14 /data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2

Launching incremental copy

[root@hyper01p ]# ./vmsync_linux_amd64 --source-domain trilium01p.badmin.local --source-uri qemu:///system --target-uri qemu+ssh://hyper02p/system --output-dir /tmp --ssh-key /root/.ssh/hyper02p --ssh-insecure-host-key
2026/04/14 18:10:26 discovered 1 qcow2 disks
2026/04/14 18:10:26 source URI does not use SSH; qemu-img info will run locally
2026/04/14 18:10:26 running local qemu-img info for disk=vda path=/data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2
2026/04/14 18:10:26 disk vda: format=qcow2 virtual-size=26843545600 path=/data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2
2026/04/14 18:10:26 created incremental checkpoint vmsync-cpt-000003 (parent=vmsync-cpt-000002)
2026/04/14 18:10:26 starting incremental pull backup using parent checkpoint=vmsync-cpt-000002 (new checkpoint=vmsync-cpt-000003)
2026/04/14 18:10:26 libvirt backup export started on source via tcp; connect host=127.0.0.1 port=10809
2026/04/14 18:10:26 reading disk vda via libvirt backup NBD tcp export=vda
2026/04/14 18:10:26 nbd: connect for extents host=127.0.0.1 port=10809 export=vda checkpoint=vmsync-cpt-000002 incremental=true
2026/04/14 18:10:26 nbd: connected for extent query export=vda
2026/04/14 18:10:26 nbd: export=vda size=26843545600 bytes
2026/04/14 18:10:26 nbd: extent scan export=vda progress=4294967295/26843545600
2026/04/14 18:10:26 nbd: extent scan export=vda progress=8589934590/26843545600
2026/04/14 18:10:26 nbd: extent scan export=vda progress=12884901885/26843545600
2026/04/14 18:10:26 nbd: extent scan export=vda progress=17179869180/26843545600
2026/04/14 18:10:26 nbd: extent scan export=vda progress=21474836475/26843545600
2026/04/14 18:10:26 nbd: extent scan export=vda progress=25769803770/26843545600
2026/04/14 18:10:26 nbd: extent scan export=vda progress=26843545600/26843545600
2026/04/14 18:10:26 nbd: extent scan complete export=vda extents=663 selected=329
2026/04/14 18:10:26 target nbd export started for /data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2 on hyper02p:20809
2026/04/14 18:10:26 copy 663 extents to remote target /data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2 (disk-size=26843545600)
2026/04/14 18:10:27 nbd: copy progress  30.45% (56688640/186187776 bytes) 53.13 MiB/s
2026/04/14 18:10:28 nbd: copy progress  73.18% (136249341/186187776 bytes) 63.24 MiB/s
2026/04/14 18:10:29 nbd: copy progress  100.00% (186187776/186187776 bytes) 63.77 MiB/s
2026/04/14 18:10:29 nbd: copied 186187776 bytes to target
2026/04/14 18:10:29 backup abort requested (cleanup), stopping libvirt backup job

On target system:

[root@hyper01p badmonitor]# ls -alhs /data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2
179M -rw-------. 1 root root 179M 14 avril 18:10 /data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2

Note that the source disk is 20G whereas the target disk is rouhly the size of the incremental send.

Decided to run it all over again and start on a fresh replica:
On target:

rm -f /data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2
virsh undefine trilium01p.badmin.loca

On source:

virsh checkpoint-delete --checkpointname vmsync-cpt-000001 --domain trilium01p.badmin.local --children
./vmsync_linux_amd64 --source-domain trilium01p.badmin.local --source-uri qemu:///system --target-uri qemu+ssh://hyper02p/system --output-dir /tmp --ssh-key /root/.ssh/hyper02p --ssh-insecure-host-key

2026/04/14 18:20:04 discovered 1 qcow2 disks
2026/04/14 18:20:04 source URI does not use SSH; qemu-img info will run locally
2026/04/14 18:20:04 running local qemu-img info for disk=vda path=/data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2
2026/04/14 18:20:04 disk vda: format=qcow2 virtual-size=26843545600 path=/data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2
2026/04/14 18:20:04 created initial checkpoint vmsync-cpt-000001
2026/04/14 18:20:04 starting full pull backup (no incremental bitmap)
2026/04/14 18:20:04 libvirt backup export started on source via tcp; connect host=127.0.0.1 port=10809
2026/04/14 18:20:04 reading disk vda via libvirt backup NBD tcp export=vda
2026/04/14 18:20:04 nbd: connect for extents host=127.0.0.1 port=10809 export=vda checkpoint=vmsync-cpt-000001 incremental=false
2026/04/14 18:20:04 nbd: connected for extent query export=vda
2026/04/14 18:20:04 nbd: export=vda size=26843545600 bytes
2026/04/14 18:20:04 nbd: extent scan export=vda progress=4294967295/26843545600
2026/04/14 18:20:04 nbd: extent scan export=vda progress=8589934590/26843545600
2026/04/14 18:20:04 nbd: extent scan export=vda progress=12884901885/26843545600
2026/04/14 18:20:04 nbd: extent scan export=vda progress=17179869180/26843545600
2026/04/14 18:20:04 nbd: extent scan export=vda progress=21474836475/26843545600
2026/04/14 18:20:04 nbd: extent scan export=vda progress=25769803770/26843545600
2026/04/14 18:20:04 nbd: extent scan export=vda progress=26843545600/26843545600
2026/04/14 18:20:04 nbd: extent scan complete export=vda extents=80 selected=42
2026/04/14 18:20:04 target nbd export started for /data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2 on hyper02p:20809
2026/04/14 18:20:04 copy 80 extents to remote target /data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2 (disk-size=26843545600)
2026/04/14 18:20:05 nbd: copy progress  0.43% (88211456/20513619968 bytes) 83.92 MiB/s
[...]
2026/04/14 18:24:27 nbd: copy progress  99.81% (20475346939/20513619968 bytes) 74.35 MiB/s
2026/04/14 18:24:27 nbd: copy progress  100.00% (20513619968/20513619968 bytes) 74.36 MiB/s
2026/04/14 18:24:28 nbd: copied 20513619968 bytes to target
2026/04/14 18:24:28 backup abort requested (cleanup), stopping libvirt backup job

Still on source:

[root@hyper01p ]# ls -alhs /data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2
20G -rw-------. 1 qemu qemu 20G 14 avril 18:30 /data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2

On target, after initial vmsync run:

[root@hyper02p ]# ls -alhs /data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2
20G -rw-------. 1 root root 20G 14 avril 18:24 /data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2

Since the target file is not an exact copy, checksumming won't do, but I somehow guess that having 20G of data on both sides is "good enough(TM)" for my test. Could probably just fire up the machine to make sure.

Now when I run a second vmsync command on source (same invocation):

2026/04/14 18:32:50 discovered 1 qcow2 disks
2026/04/14 18:32:50 source URI does not use SSH; qemu-img info will run locally
2026/04/14 18:32:50 running local qemu-img info for disk=vda path=/data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2
2026/04/14 18:32:50 disk vda: format=qcow2 virtual-size=26843545600 path=/data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2
2026/04/14 18:32:50 created incremental checkpoint vmsync-cpt-000002 (parent=vmsync-cpt-000001)
2026/04/14 18:32:50 starting incremental pull backup using parent checkpoint=vmsync-cpt-000001 (new checkpoint=vmsync-cpt-000002)
2026/04/14 18:32:50 libvirt backup export started on source via tcp; connect host=127.0.0.1 port=10809
2026/04/14 18:32:50 reading disk vda via libvirt backup NBD tcp export=vda
2026/04/14 18:32:50 nbd: connect for extents host=127.0.0.1 port=10809 export=vda checkpoint=vmsync-cpt-000001 incremental=true
2026/04/14 18:32:50 nbd: connected for extent query export=vda
2026/04/14 18:32:50 nbd: export=vda size=26843545600 bytes
2026/04/14 18:32:50 nbd: extent scan export=vda progress=4294967295/26843545600
2026/04/14 18:32:50 nbd: extent scan export=vda progress=8589934590/26843545600
2026/04/14 18:32:50 nbd: extent scan export=vda progress=12884901885/26843545600
2026/04/14 18:32:50 nbd: extent scan export=vda progress=17179869180/26843545600
2026/04/14 18:32:50 nbd: extent scan export=vda progress=21474836475/26843545600
2026/04/14 18:32:50 nbd: extent scan export=vda progress=25769803770/26843545600
2026/04/14 18:32:50 nbd: extent scan export=vda progress=26843545600/26843545600
2026/04/14 18:32:50 nbd: extent scan complete export=vda extents=469 selected=231
2026/04/14 18:32:52 target nbd export started for /data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2 on hyper02p:20809
2026/04/14 18:32:52 copy 469 extents to remote target /data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2 (disk-size=26843545600)
2026/04/14 18:32:52 nbd: copy progress  100.00% (22544384/22544384 bytes) 44.88 MiB/s
2026/04/14 18:32:52 nbd: copied 22544384 bytes to target
2026/04/14 18:32:52 backup abort requested (cleanup), stopping libvirt backup job

on target:

[root@hyper02p ]# ls -alhs /data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2
23M -rw-------. 1 root root 23M 14 avril 18:32 /data/private_vm/badmin/trilium01p.badmin.local-disk0.qcow2

Again, lost the parent backing checkpoint.

Is there something I missed ?
I have the same behavior on my other remote hosts where I tried vmsync.

Some context:
Both source and destination run latest AlmaLinux 10.1 and have identical chassis (source has better cpu/ram/disks but I guess that's out of scope).

Source

uname -a
Linux hyper01p.badmin.local 6.12.0-124.47.1.el10_1.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 31 06:07:05 EDT 2026 x86_64 GNU/Linux

rpm -qa | grep -E "libvirt|nbd|qemu"

python3-libvirt-11.5.0-1.el10.x86_64
libvirt-glib-5.0.0-4.el10.x86_64
libvirt-dbus-1.4.1-6.el10.x86_64
ipxe-roms-qemu-20240119-5.gitde8a0821.el10.noarch
qemu-kvm-common-10.0.0-14.el10_1.5.alma.1.x86_64
nbdkit-server-1.44.1-4.el10_1.x86_64
qemu-img-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-ui-opengl-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-device-display-virtio-gpu-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-device-display-virtio-gpu-pci-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-ui-egl-headless-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-ui-spice-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-core-10.0.0-14.el10_1.5.alma.1.x86_64
nbdkit-basic-filters-1.44.1-4.el10_1.x86_64
nbdkit-basic-plugins-1.44.1-4.el10_1.x86_64
nbdkit-curl-plugin-1.44.1-4.el10_1.x86_64
nbdkit-ssh-plugin-1.44.1-4.el10_1.x86_64
qemu-kvm-audio-pa-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-block-blkio-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-block-rbd-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-device-display-virtio-vga-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-device-usb-host-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-device-usb-redirect-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-pr-helper-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-tools-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-docs-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-10.0.0-14.el10_1.5.alma.1.x86_64
nbdkit-selinux-1.44.1-4.el10_1.noarch
nbdkit-1.44.1-4.el10_1.x86_64
libnbd-1.22.2-3.el10_1.x86_64
libvirt-libs-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-lock-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-log-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-client-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-proxy-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-client-qemu-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-ssh-proxy-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-common-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-storage-core-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-network-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-nwfilter-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-interface-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-nodedev-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-qemu-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-secret-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-plugin-lockd-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-config-nwfilter-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-config-network-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-storage-disk-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-storage-iscsi-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-storage-logical-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-storage-mpath-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-storage-rbd-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-storage-scsi-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-storage-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-kvm-11.5.0-4.8.el10_1.alma.1.x86_64

Destination

uname -a
Linux hyper02p.badmin.local 6.12.0-124.47.1.el10_1.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 31 06:07:05 EDT 2026 x86_64 GNU/Linux

rpm -qa | grep -E "libvirt|nbd|qemu"

libvirt-libs-11.5.0-4.8.el10_1.alma.1.x86_64
nbdkit-server-1.44.1-4.el10_1.x86_64
libvirt-client-11.5.0-4.8.el10_1.alma.1.x86_64
qemu-img-10.0.0-14.el10_1.5.alma.1.x86_64
libvirt-daemon-lock-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-log-11.5.0-4.8.el10_1.alma.1.x86_64
python3-libvirt-11.5.0-1.el10.x86_64
qemu-pr-helper-10.0.0-14.el10_1.5.alma.1.x86_64
nbdkit-basic-filters-1.44.1-4.el10_1.x86_64
nbdkit-basic-plugins-1.44.1-4.el10_1.x86_64
nbdkit-curl-plugin-1.44.1-4.el10_1.x86_64
nbdkit-ssh-plugin-1.44.1-4.el10_1.x86_64
libvirt-glib-5.0.0-4.el10.x86_64
libvirt-dbus-1.4.1-6.el10.x86_64
libvirt-daemon-proxy-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-common-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-storage-core-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-network-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-config-network-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-storage-disk-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-nwfilter-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-secret-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-config-nwfilter-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-storage-iscsi-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-storage-logical-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-storage-mpath-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-storage-rbd-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-storage-scsi-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-storage-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-driver-interface-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-plugin-lockd-11.5.0-4.8.el10_1.alma.1.x86_64
libvirt-daemon-11.5.0-4.8.el10_1.alma.1.x86_64
qemu-kvm-docs-10.0.0-14.el10_1.5.alma.1.x86_64
libvirt-client-qemu-11.5.0-4.8.el10_1.alma.1.x86_64
nbdkit-selinux-1.44.1-4.el10_1.noarch
nbdkit-1.44.1-4.el10_1.x86_64
libvirt-daemon-driver-nodedev-11.5.0-4.8.el10_1.alma.1.x86_64
libnbd-1.22.2-3.el10_1.x86_64
libvirt-daemon-driver-qemu-11.5.0-4.8.el10_1.alma.1.x86_64
ipxe-roms-qemu-20240119-5.gitde8a0821.el10.noarch
qemu-kvm-common-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-ui-opengl-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-device-display-virtio-gpu-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-device-display-virtio-gpu-pci-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-ui-egl-headless-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-block-blkio-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-block-rbd-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-device-display-virtio-vga-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-device-usb-host-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-device-usb-redirect-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-audio-pa-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-core-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-tools-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-ui-spice-10.0.0-14.el10_1.5.alma.1.x86_64
qemu-kvm-10.0.0-14.el10_1.5.alma.1.x86_64
libvirt-11.5.0-4.8.el10_1.alma.1.x86_64

0 replies

abbbi · 2026-04-14T17:56:28Z

abbbi
Apr 14, 2026
Maintainer

hm.. i would have had the impression its possible to incrementally change a qcow image via NBD, but seems im wrong.. I think what needs to be done is to create a new, temporary qcow image on the target with backing image pointing to the already existing one and then rebase the contents... like described in the qemu documentation: https://qemu-project.gitlab.io/qemu/interop/bitmaps.html#example-second-incremental-backup
and:
https://events19.linuxfoundation.org/wp-content/uploads/2017/12/Eric-Blake_2018-libvirt-incremental-backup.pdf
or i have somehow fucked up the offset calculation while querying the changed extents, but it doesnt look like it.

0 replies

abbbi · 2026-04-14T18:58:57Z

abbbi
Apr 14, 2026
Maintainer

yeah, apparently a temporary image with pointing to the base image via backing file option and then committing the changes is required.
Ive pushed a new release and i can confirm now the file isnt overwritten anymore and after commit, i can actually see the latest changes in the synced volume using virt-ls.

so my test was:

full sync
create files in vm, "sync" to sync flush changes to disk (not required if qemu-agent is running, as fsfreeze will be taking care)
incremental sync
on the target system use virt-ls to verify the changed files exist within qcow image on target system

 virt-ls -a /tmp/tmp.we3yOvdptW/fstrim.qcow2  /
NEW_FILES_FOR_INC

ive pushed a new release.

4 replies

deajan Apr 14, 2026
Collaborator Author

I'll have a test ride right now ;)

deajan Apr 14, 2026
Collaborator Author

So far so good, incremental replication looks okay now ;)

Thinking of the messages I see while replicating, I was wondering if commiting the differental into base image should become an optional step ? As per Hyper-V and others, it's interesting to keep last n snapshots around on disaster recovery software. Of course it would be the responsability of a script to make the xml file point to the right backing chain.

I've also tried to simulate some catastrophic scenarios, eg boot the replicated VM than redo a sync, or worse, send incremental snapshots with missing intermediary ones.
Of course, all those scenarios are possible (and will obviously be done by users), and I am thinking of an "easy and cheap" way to prevent that.
A very basic solution would be to keep around a vm.qcow2.vmsync.status on the target side in which the last replicated checkpoint name gets written along with the crc32 sum of the current file. On next run, vmsync could check whether last checkpoint name corresponds to the existing checkpoint, and check whether the crc32 sum corresponds with the current file.
crc32 is quite cheap on cpu and should be as fast as ssds can be, but still it would be an optional check since on very big qcow2 images that could quickly become a problem.
Perhaps another solution would be to just store the last checkpoint name and mtime of the file, and if one of both did change, than trigger a full resync. The problem with mtime being that both the mtime writing to file and check must be done on the target side, in order to prevent clock drifts.

I'm not sure whether I'm bugging you with all this "feature brainstorming", just let me know if that's the case.

deajan Apr 14, 2026
Collaborator Author

Also, again,, sorry if this is some kind of notepad, I'm just writing down some ideas:

Possible features:

optional quiescing (recommended, since "raw" copying would probably corrupt all kinds of log files / journals / databases...)
target state check before replication (checkpoint name + [mtime|crc32] of file)
optional keeping of n checkpoints as separate files in backing chain on target (hard effort unless the "current" file always has the base name, and elder backing files have numbered extensions. In that case, vmsync wouldn't need to care how much backing files there are when syncing. Flattening could be done if n checkpoints exist on target after sync process via qemu-img commit)
optional clobber option to overwrite existing VM
custom nbd-server port (in case of firewall)
nbd-server over ssh support (hard effort ? or opening a ssh session and making a port forwarding means keeping an ssh session open)
optional pipe filters (zstd/mbuffer/openssl) that would make stream compressed/buffered/encrypted (also maybe hard effort since go is used to actually read / write. Perhaps using a "dumb shell" command that does the read write would allow easier plugging of stream filters ? Just my evening thought)
basic prometheus metric output (eg a textcollector file with replication success, throughput and elapsed time, with labels being source and target hosts, and of course vm name)

I am aware that you did say you don't want to spend time on this. These are just some random ideas.
Perhaps we could have a chat (I speak fluent german btw... at least I did sometime ago).

abbbi Apr 16, 2026
Maintainer

optional quiescing (recommended, since "raw" copying would probably corrupt all kinds of log files / journals / databases...)

it uses the qemu agent to freeze filesystems by default while creating the checkpoints, issues warning if this fails.

target state check before replication (checkpoint name + [mtime|crc32] of file)

not sure what the best approach is here..

optional keeping of n checkpoints as separate files in backing chain on target (hard effort unless the "current" file always has the base name, and elder backing files have numbered extensions. In that case, vmsync wouldn't need to care how much backing files there are when syncing. Flattening could be done if n checkpoints exist on target after sync process via qemu-img commit)

hm, i dont want to create another backup utility :)

custom nbd-server port (in case of firewall)

thats already implemented (-target-nbd-port), iterates up from this port for each disk.

nbd-server over ssh support (hard effort ? or opening a ssh session and making a port forwarding means keeping an ssh session open)

hm..

optional pipe filters (zstd/mbuffer/openssl) that would make stream compressed/buffered/encrypted (also maybe hard effort since go is used to actually read / write. Perhaps using a "dumb shell" command that does the read write would allow easier plugging of stream filters ? Just my evening thought)

its one application reading from NBD A, writing to NBD B, makes not quite sense for me, adding compression doesnt help here, its not a server -> client type situation. Security wise TLS could be used for the target NBD service but that requires some setup beforehand.

ive pushed some new releases with more features you may want to check out (such as paralell processing of disks, etc)

deajan · 2026-04-17T12:03:05Z

deajan
Apr 17, 2026
Collaborator Author

Currently making some tests.
Quick very low effort remark: a --version option would be helpful so I am sure on which version I actually run the tests ;)

not sure what the best approach is here..

I'd say it depends on what level of "correctness" you want to achieve. At least mtime check is fairly easy to implement and covers 99% of all problems, the same way as rsync does when not runnig with --checksum

hm, i dont want to create another backup utility :)

I totally see your point of course. It's just that on a disaster recovery, being able to rollback to n-X snapshots/checkpoints is a very common and useful option, when one replicates let's say every 15 minutes and a disaster happened 2 hours ago or so, that's where it's a real plus to have those checkpoints/snapshots ready, without the actual need to restore from a backup.
Anyway, that might be a "maybe later feature".

thats already implemented (-target-nbd-port), iterates up from this port for each disk.

Cool, didn't properly read all the options ;)

its one application reading from NBD A, writing to NBD B, makes not quite sense for me, adding compression doesnt help here, its not a server -> client type situation. Security wise TLS could be used for the target NBD service but that requires some setup beforehand.

For the security concerns, I've setup a VPN between some test servers to secure transfers.
But I don't really understand why compression wouldn't be a big improvment.
I mean the NBD stream data is basically the contents of a qcow, which might be compressible or not depending on the actual data, and compression is always welcome when transfering data over a WAN link.

As for the actual tests:
I've played today with v0.16, and while trying to find the ssh known hosts file issue, found another (maybe) issue:

I have two hosts: A and B
Those hosts have public IPs and a VPN between them.

Both hosts are setup so I can ssh from A into B as root and from B into A, using the public IPs or the VPN IPs, with the same ssh key.
Example:

Using public IP

[root@hyper02p ]# ssh -i /root/.ssh/hyper01p hv01.publicip.fqdn
[root@hyper01p ~]#

Using VPN IP

[root@hyper02p ]# ssh -i /root/.ssh/hyper01p 10.11.12.13
[root@hyper01p ~]#

I can use vmsync via the public IPs

[root@hyper02p ]# ./vmsync -source-domain test01p.npf.local --source-uri qemu:///system --target-uri qemu+ssh://hv01.publicip.fqdn/system --ssh-key /root/.ssh/hyper01p --debug
2026/04/17 13:42:47 INFO discovered source domain domain=test01p.local
2026/04/17 13:42:47 INFO skipping cdrom device device=sda
2026/04/17 13:42:47 INFO discovered qcow2 disks count=1
[...]

Whereas the same command via the VPN asks me for the password.

[root@hyper02p ]# ./vmsync -source-domain test01p.npf.local --source-uri qemu:///system --target-uri qemu+ssh://10.11.12.13/system --ssh-key /root/.ssh/hyper01p --debug
root@10.11.12.13's password:

On the hyper01p side, there's only one line I'm getting in the sshd logs when I use vmsync to connect via VPN:

avril 17 13:47:44 hyper01p.local sshd-session[2170]: Connection closed by authenticating user root 10.11.13.13 port 63220 [preauth]

I have checked this multiple times, and have no idea why ssh would work via my VPN but vmsync wouldn't.
I've disabled my firewall on both sides just to outrule any further problem, even it wouldn't make sense.
The only difference I see would be the MTU, which of course is lower on the VPN link (actual MTU is 1380).

Is there any chance that the ssh implementation can't properly detect TCP MSS ?
I'll setup another VPN link later with a MTU of 1500 (need to do this locally of course with jumbo frames) in order to confirm said diag.

[EDIT]
Both VPN endpoint ifaces have a proper MTU announced

15: wg_vmsync0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1380 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none
    inet 10.11.12.13/24 scope global wg_npf0
       valid_lft forever preferred_lft forever

I could probably use tcp mss clamping, which would probably resvolve the issue, but that's rather a big decision on servers that shouldn't be taken just for one program
[/EDIT]

0 replies

abbbi · 2026-04-19T09:34:53Z

abbbi
Apr 19, 2026
Maintainer

the qemu URI is not part of the regular ssh communication but the libvirt layer. I guess virsh -c qemu+ssh://10.11.12.13/system may ask for th epassword, too.

0 replies

deajan · 2026-04-19T10:10:51Z

deajan
Apr 19, 2026
Collaborator Author

Thanks for the insight, that helped me diag the issue.
My intial MTU assumption was wrong, and after having updated my /root/.ssh/config file as well as my /etc/hosts file, I can now successfully work with vmsync over the VPN layer.
My problem came from the fact that specifying the ssh-key for your binary works, whereas virsh expects a simple ssh myhost command to work, the key being pre-configured in /root/.ssh/config or so. Knowing that your program uses both a probably go implementation that will use --ssh-key and a virsh invocation that just expects the ssh link to work without the --ssh-key made me realise this.

Perhaps, in order to be consistent, you should remove --ssh-key and just expect the user to have a /home/user/.ssh/config file like

Host hv02.local
     HostName 10.11.12.13
     Port     22
     User     root
     IdentityFile       /root/.ssh/hyper02p

This way, both implementations would use the same authentication method, as defined by the system.
Of course, I have no idea if you're actually invoking ssh as shell command or if you use a go implementation for your ssh control channel, so that might not make sense at all.

A part from the ssh stuff;
Interestingly enough, I found that vmsync left a qemu-nbd server running while doing some test sync:

2026/04/19 12:01:19 INFO removed checkpoint after sync failure checkpoint=vmsync-cpt-000001
2026/04/19 12:01:19 ERROR sync failed error=start target qemu-nbd for /var/lib/libvirt/images/test01p.local-disk0.qcow2: ssh run command "qemu-nbd --fork --persistent --format=qcow2 --bind '0.0.0.0' --port 20809 --pid-file '/tmp/vmsync-qemu-nbd-test01p.local-vda.pid' '/var/lib/libvirt/images/test01p.local-disk0.qcow2'": Process exited with status 1: qemu-nbd: Failed to find an available port: Address already in use

I've logged into the remote host to check:

ps aux | grep nbd
root     28037  0.0  0.0   6400  2276 pts/1    S+   12:07   0:00 grep --color=auto nbd
root     61739  0.0  0.0 210276  7196 ?        Ssl  avril17   0:00 qemu-nbd --fork --persistent --format=qcow2 --bind 0.0.0.0 --port 20809 --pid-file /tmp/vmsync-qemu-nbd-test01p.local-vda.pid /var/lib/libvirt/images/test01p.local-disk0.qcow2

Interestingly enough, I've killed the process with kill 61739, but the pid file stays (and yes, the content of the pid file was the corresponding pid). In Python, I use atexit to make sure I get to cleanup child processed before getting shutdown (unless kill -9 is involved obviously).
Perhaps that would be a good solution for vmsync either to use the go equivalent.
I've only killed the process (and left the pid file behind), but could start vmsync again.

Anyway, I'm back at making vmsync experiments now ;)

0 replies

deajan · 2026-04-19T11:00:14Z

deajan
Apr 19, 2026
Collaborator Author

Just some quick feedback:

Found the issue with the sticking qemu-nbd server:
I've made a couple of benchmarks in and ouside of the VPN, and with 0.12 and 0.16 versions.

While working outside of the VPN, I didn't have the default qemu-nbd port open on target system, so vmsync ended with error:

[...]
2026/04/19 12:36:03 INFO removed checkpoint after sync failure checkpoint=vmsync-cpt-000001
2026/04/19 12:36:03 ERROR sync failed error=wait for target nbd export hv01.somdomain.tld:20809: nbd export not ready on hv01.somdomain.tld:20809

This ended vmsync, but did keep qemu-nbd active on the target system, hence giving an explanation on the above error report.

I've made tests with both vmsync 0.16 and 0.12:
Performance wise they're almost equal, but surprisingly I get much better speeds today, but there can be alot of explanations (wan routes may have changed, sunday there's less traffic, whatever).

At least these numbers are way closer to my zfs replications.

There is a minor feature I'd like to request:

a --json parameter that would, at the end of vmsync execution spit something alike {"result": true, "vm": "myvmname", "transfer_duration": 121, "written_bytes": 5871763456, "checkpoint": "vmsync-cpt-00004"}

This would allow me to parse info, and integrate it into a monitoring system.
I would be able to setup vmsync on my test servers as cron task if I get to acquire those json metrics.

0 replies

deajan · 2026-04-20T10:40:27Z

deajan
Apr 20, 2026
Collaborator Author

Thanks for the version number in logs as well as the nbd server fix.
I've setup a cron to replicate a VM from host A to B and another VM from host B to A.
Currently redirecting all the logs to a file in order to check later that everything went smooth.
In a day or so, I'll have a "sanity" check done for both replicas.

3 replies

deajan Apr 20, 2026
Collaborator Author

Quick side note, I am still replicating a VM, and have setup a 5 minute cron job with the exact same command:

2026/04/20 12:45:01 INFO [/root/vmsync --source-domain b03i..local --source-uri qemu:///system --target-uri qemu+ssh://hv01.local/system --ssh-key /root/.ssh/hyper01p -ssh-insecure-host-key], Version: 0.18
2026/04/20 12:45:02 INFO discovered source domain domain=bgp03i.val.npf.local
2026/04/20 12:45:02 INFO skipping cdrom device device=sda
2026/04/20 12:45:02 INFO discovered qcow2 disks count=1
2026/04/20 12:45:02 INFO source URI does not use SSH; qemu-img info will run locally
2026/04/20 12:45:02 INFO running local qemu-img info disk=vda path=/var/lib/libvirt/images/b03i.local-disk0.qcow2
2026/04/20 12:45:02 INFO disk info disk=vda format=qcow2 virtual_size=0 path=/var/lib/libvirt/images/b03i.local-disk0.qcow2 discard=
2026/04/20 12:45:02 ERROR sync failed error=Incremental sync attempted but target domain does not exist: domain=03i.local

Obviously this fails because the initial replication is not yet done, so there is no target domain defined yet.
Perhaps there should be some kind of concurrency check, in a pidfile way, where no more than one command can be run at a time on the same VM ?

The design itself saved from errors, but if setup to run every 5 minutes, and a replication needs more than 5 minutes, it should abort I think.

abbbi Apr 20, 2026
Maintainer

ofc theres no locking mechanism implemented currently.

deajan Apr 20, 2026
Collaborator Author

I'm already quite happy to be able to test this tool TBH ;)
I'm just using this discussion as some kind of "whiteboard" for all things I find about the tool. Just let me know if that makes too much noise ;)

Btw, since my last post, I've had 37 consecutive runs of vmsync without any error. Guess I'll find what happens once vmsync-cpt-000037 becomes vmsync-cpt-999999 in roughly 2 years at this pace ;)

This evening I'll shutdown both source and target vm, make another sync, clone both machines and them compare the clones while original sync continues to work...
For the comparison I'll use something along mounting the qcow2 file, mounting the main FS and md5summing all files on both copies. That's a bit "lazy" but I since it's not a byte-to-byte copy, that's the quickest way I can think of to check for coherent replication.

deajan · 2026-04-20T22:19:58Z

deajan
Apr 20, 2026
Collaborator Author

So far a replica failed.
I got a 7.5GB qcow2 file which I can mount (boot partition seems fine, but main partition is unreadable).
Since I got the above "nbd export not ready" error, there might be an explanation.

I've decided to discard both replicas and start over again with vmsync 0.19.
I've setup again a 5 minute cron file from A to B and from B to A.

I'll report back.

0 replies

abbbi · 2026-04-21T06:16:22Z

abbbi
Apr 21, 2026
Maintainer

for testing ive been using something like this.

vm is running qemu agent and accessible via ssh
create file in source vm, store md5sum
vmsync
use virt-cat on target hv to cat file in synced vm image, compute md5sum
compare

so fare ive not seen any issues.

set -e
set -u
set -o pipefail
 

SUM=$(ssh -l arch 192.168.121.35 "dd if=/dev/urandom bs=1M count=5 of=data2 status=none; md5sum data2" | awk '{print $1}')
sudo -E /home/abi/go/bin/go run ./cmd/vmsync/ -source-domain fstrim -source-uri qemu+ssh://root@localhost/system --ssh-user root --target-uri qemu+ssh://192.168.161.196/system -ssh-insecure-host-key 
 
TSUM=$(ssh -l root 192.168.161.196 virt-cat -a /tmp/tmp.NU7Aaf6bVg/fstrim.qcow2 /home/arch/data2 | md5sum | awk '{print $1}')
 
if [ "$TSUM" != "$SUM" ]; then
    echo "fail"
fi

3 replies

deajan Apr 21, 2026
Collaborator Author

Looks like a good testing strategy ;)
I will still make my tests, ie let my vmsync crons run (restarted fresh yesterday evening), and at some point, stop the sync, stop the original VM and compare FSes of the original and replica. I'm trying to validate the long run.
I also log the vmsync stdout/stderr output to make sure I get to see where possible errors could be introduced.

My main concerns are:

Aborted replica actions, ie what happens on a WAN link failure while vmsync is running (I suppose that the qcow2 overlay is half written. and also suppose that it will be overwritten by next run in case of previous failure)
How to ensure that the replica is "in good shape" to receive the next replica

PS: I will be offline for 9 days beginning with friday. I'll try to report back before

Again, and since you decided to take some time to implement this software, I am really willing to sponsor your work, regardless of the outcome, as long as the source gets GPL/BSD/MIT at some point.
Heck I am even willing to learn to code in go just to add some features at some point if you're comfortable with that idea.
I really look forward for three major features in some future:

Check if replica is consistent before sending next diff (via target disk files mtime or checksum comparison before and after each run + last checkpoint comparison with current new checkpoint which should be n+1)
Restart full replica if not consistent (automatically or via --force-full option which would clobber destination)
Allow to change target disk path (would allow me to run vmsync in production side-by-side with my current zfs replication until I feel comfortable for a switch)

Anyway, for now I can only say this:
Thank you for spending some time on this.

abbbi Apr 21, 2026
Maintainer

we could agree on source release with propper license with sponsoring, you can contact me privately via abi@grinser.de

deajan Apr 21, 2026
Collaborator Author

we could agree on source release with propper license with sponsoring, you can contact me privately via abi@grinser.de

Reply done ;)

Uh oh!

vmsync: VM replication tool: Discussion #1

Uh oh!

deajan Apr 10, 2026 Collaborator

Replies: 18 comments · 26 replies

Uh oh!

Uh oh!

abbbi Apr 10, 2026 Maintainer

Uh oh!

abbbi Apr 13, 2026 Maintainer

Uh oh!

deajan Apr 14, 2026 Collaborator Author

Uh oh!

abbbi Apr 14, 2026 Maintainer

Uh oh!

Uh oh!

deajan Apr 14, 2026 Collaborator Author

Uh oh!

abbbi Apr 14, 2026 Maintainer

Uh oh!

abbbi Apr 14, 2026 Maintainer

Uh oh!

Uh oh!

deajan Apr 14, 2026 Collaborator Author

Uh oh!

abbbi Apr 14, 2026 Maintainer

Uh oh!

deajan Apr 14, 2026 Collaborator Author

Uh oh!

deajan Apr 14, 2026 Collaborator Author

Uh oh!

Uh oh!

abbbi Apr 14, 2026 Maintainer

Uh oh!

deajan Apr 14, 2026 Collaborator Author

Uh oh!

deajan Apr 14, 2026 Collaborator Author

Uh oh!

deajan Apr 14, 2026 Collaborator Author

Uh oh!

Uh oh!

abbbi Apr 14, 2026 Maintainer

Uh oh!

deajan Apr 14, 2026 Collaborator Author

Uh oh!

Uh oh!

abbbi Apr 14, 2026 Maintainer

Uh oh!

deajan Apr 14, 2026 Collaborator Author

Uh oh!

Uh oh!

deajan Apr 14, 2026 Collaborator Author

Uh oh!

Uh oh!

abbbi Apr 14, 2026 Maintainer

Uh oh!

Uh oh!

abbbi Apr 14, 2026 Maintainer

Uh oh!

deajan Apr 14, 2026 Collaborator Author

Uh oh!

deajan Apr 14, 2026 Collaborator Author

Uh oh!

Uh oh!

deajan
Apr 10, 2026
Collaborator

Replies: 18 comments 26 replies

abbbi
Apr 10, 2026
Maintainer

abbbi Apr 13, 2026
Maintainer

deajan Apr 14, 2026
Collaborator Author

abbbi Apr 14, 2026
Maintainer

deajan Apr 14, 2026
Collaborator Author

abbbi Apr 14, 2026
Maintainer

abbbi
Apr 14, 2026
Maintainer

deajan
Apr 14, 2026
Collaborator Author

abbbi Apr 14, 2026
Maintainer

deajan Apr 14, 2026
Collaborator Author

deajan
Apr 14, 2026
Collaborator Author

abbbi
Apr 14, 2026
Maintainer

deajan Apr 14, 2026
Collaborator Author

deajan
Apr 14, 2026
Collaborator Author

deajan
Apr 14, 2026
Collaborator Author

abbbi Apr 14, 2026
Maintainer

deajan Apr 14, 2026
Collaborator Author

abbbi Apr 14, 2026
Maintainer

deajan
Apr 14, 2026
Collaborator Author

deajan
Apr 14, 2026
Collaborator Author

abbbi
Apr 14, 2026
Maintainer

abbbi
Apr 14, 2026
Maintainer

deajan Apr 14, 2026
Collaborator Author

deajan Apr 14, 2026
Collaborator Author