- 1. Overview
- 2. Common Recovery Scenarios
- 3. Data Verification
- 4. Segment Recovery
- 5. Node Failure Recovery
- 6. Cluster Recovery
- 7. Backup and Restore
- 8. Emergency Procedures
- Appendix: CLI Command Reference
Runbook for common LANCE recovery scenarios including data corruption, node failure, and disaster recovery.
This document provides step-by-step procedures for recovering LANCE from various failure scenarios. Always verify backups exist before attempting recovery operations.
- Access to LANCE CLI (
lnc) - Access to data directory
- Sufficient disk space for recovery operations
- Network access to cluster nodes (for cluster recovery)
| Priority | Scenario | Impact |
|---|---|---|
| P1 | Total data loss | Critical - immediate action |
| P2 | Leader node failure | High - automatic failover should occur |
| P3 | Follower node failure | Medium - reduced redundancy |
| P4 | Corrupt segment | Low - single segment affected |
Symptoms:
- CRC validation failures during reads
- Error logs showing "checksum mismatch"
- Consumer unable to read specific offset ranges
Recovery Steps:
# 1. Identify the corrupt segment
lnc verify-data --path /data/lance --topic <topic_name>
# 2. If recoverable, repair the segment
lnc repair --path /data/lance/segments/<topic_name>/<segment>.lnc
# 3. If not recoverable, truncate to last valid offset
lnc scan --path /data/lance/segments/<topic_name>/<segment>.lnc
# Note the last valid offset, then truncate
# 4. Rebuild the index
lnc rebuild-index --path /data/lance/segments/<topic_name>Symptoms:
- Fast lookups fail
- Fallback to linear scan (slow reads)
- Missing
.idxfile in topic directory
Recovery Steps:
# Rebuild index from segment data
lnc rebuild-index --path /data/lance/segments/<topic_name>
# Verify the rebuilt index
lnc inspect-index --path /data/lance/segments/<topic_name>/index.idxSymptoms:
- Server fails to start
- Error: "WAL replay failed"
- Incomplete writes before crash
Recovery Steps:
# 1. Backup current state
cp -r /data/lance /data/lance.backup
# 2. Attempt WAL repair
lnc repair --path /data/lance/wal
# 3. If repair fails, skip corrupt entries
# WARNING: This may lose recent uncommitted data
lnc repair --path /data/lance/wal --skip-corrupt
# 4. Restart the server
systemctl start lance# Verify all data for a topic
lnc verify-data --path /data/lance --topic <topic_name>
# Stop on first error (for scripting)
lnc verify-data --path /data/lance --topic <topic_name> --fail-fast# Scan individual segment
lnc scan --path /data/lance/segments/<topic_name>/<segment>.lnc
# Output includes:
# - Record count
# - Offset range
# - CRC status for each recordFor production, run periodic verification:
# Cron job (daily at 3 AM) — replace <topic_name> with your topic name
0 3 * * * /usr/local/bin/lnc verify-data --path /data/lance --topic <topic_name> >> /var/log/lance-verify.log 2>&1When a segment has corruption at the end (common after crash):
# 1. Find last valid record
lnc scan --path /data/lance/segments/<topic_name>/<segment>.lnc
# 2. Note the offset of last valid record
# 3. Truncate using repair
lnc repair --path /data/lance/segments/<topic_name>/<segment>.lncIf a segment is completely unrecoverable:
# 1. Backup the corrupt segment
mv /data/lance/segments/<topic_name>/<segment>.lnc /data/lance/quarantine/
# 2. Update metadata to skip the segment
# This creates a gap in offsets - consumers must handle this
# 3. Rebuild index to reflect changes
lnc rebuild-index --path /data/lance/segments/<topic_name>Expected Behavior: Automatic failover within election timeout (150-300ms)
If Automatic Failover Fails:
# 1. Check cluster status from any healthy node
lnc cluster-status --server <healthy_node>:1992
# 2. If no leader elected, check quorum
# Need majority of nodes online for election
# 3. If split-brain suspected, isolate the partition
# Stop the minority partition nodes
# 4. Force election on majority partition
# (Automatic - nodes will elect after timeout)# 1. Start the failed node
systemctl start lance
# 2. Node will automatically:
# - Rejoin cluster
# - Receive missed replication data
# - Become available for reads
# 3. Verify node health
lnc cluster-status --server <recovered_node>:1992When adding a new node to replace a failed one:
# 1. Update cluster configuration with new node address
# Edit lance.toml: cluster.peers
# 2. Start the new node
systemctl start lance
# 3. Node will receive full state via replication
# 4. Verify cluster health
lnc cluster-status --server <any_node>:1992Symptoms:
- Two leaders accepting writes
- Divergent data between partitions
Recovery Steps:
# 1. Identify which partition has more recent data
# Check term numbers - higher term is authoritative
# 2. Stop nodes in the stale partition
systemctl stop lance # on stale nodes
# 3. Export any unique data from stale partition
lnc export-topic --path /data/lance --topic <topic_name> --output /backup/stale_data.lance
# 4. Wipe stale partition data
rm -rf /data/lance/*
# 5. Restart nodes - they will sync from authoritative partition
systemctl start lance
# 6. Manually merge exported data if neededRecovery from backup:
# 1. Start one node as standalone
lance --standalone --data-dir /data/lance
# 2. Import backup data
lnc import-topic --path /data/lance --name <topic_name> --input /backup/topic.lance
# 3. Repeat for all topics
# 4. Stop standalone mode, restart in cluster mode
systemctl restart lance
# 5. Add remaining nodes to cluster# Export single topic
lnc export-topic --path /data/lance --topic <topic_name> --output /backup/<topic_name>.lance
# Export all topics (script)
for topic in $(lnc list-topics --server localhost:1992 | jq -r '.[].name'); do
lnc export-topic --path /data/lance --topic "$topic" --output "/backup/${topic}.lance"
done# Restore single topic
lnc import-topic --path /data/lance --name <topic_name> --input /backup/<topic_name>.lance
# Verify restored data
lnc verify-data --path /data/lance --topic <topic_name>- Frequency: Daily full backup, continuous replication for DR
- Retention: 7 days minimum, 30 days recommended
- Testing: Monthly restore test to verify backup integrity
- Storage: Off-site backup for disaster recovery
If data corruption is spreading:
# Immediate stop all nodes
for node in node1 node2 node3; do
ssh $node "systemctl stop lance"
doneTo prevent further writes while investigating:
# Set read-only flag (requires restart)
echo "read_only = true" >> /etc/lance/lance.toml
systemctl restart lance| Role | Contact |
|---|---|
| On-Call Engineer | [PagerDuty/OpsGenie] |
| Database Lead | [Email/Slack] |
| Infrastructure | [Email/Slack] |
| Command | Purpose |
|---|---|
lnc verify-data |
Check data integrity |
lnc repair |
Repair corrupt segments |
lnc rebuild-index |
Rebuild topic index |
lnc scan |
Scan segment contents |
lnc export-topic |
Export topic for backup |
lnc import-topic |
Import topic from backup |
lnc cluster-status |
Check cluster health |