Skip to content

Latest commit

 

History

History
365 lines (249 loc) · 8.45 KB

File metadata and controls

365 lines (249 loc) · 8.45 KB

← Back to Docs Index

RecoveryProcedures

Table of Contents

Runbook for common LANCE recovery scenarios including data corruption, node failure, and disaster recovery.


1. Overview

This document provides step-by-step procedures for recovering LANCE from various failure scenarios. Always verify backups exist before attempting recovery operations.

1.1 Prerequisites

  • Access to LANCE CLI (lnc)
  • Access to data directory
  • Sufficient disk space for recovery operations
  • Network access to cluster nodes (for cluster recovery)

1.2 Recovery Priority

Priority Scenario Impact
P1 Total data loss Critical - immediate action
P2 Leader node failure High - automatic failover should occur
P3 Follower node failure Medium - reduced redundancy
P4 Corrupt segment Low - single segment affected

2. Common Recovery Scenarios

2.1 Corrupt Segment

Symptoms:

  • CRC validation failures during reads
  • Error logs showing "checksum mismatch"
  • Consumer unable to read specific offset ranges

Recovery Steps:

# 1. Identify the corrupt segment
lnc verify-data --path /data/lance --topic <topic_name>

# 2. If recoverable, repair the segment
lnc repair --path /data/lance/segments/<topic_name>/<segment>.lnc

# 3. If not recoverable, truncate to last valid offset
lnc scan --path /data/lance/segments/<topic_name>/<segment>.lnc
# Note the last valid offset, then truncate

# 4. Rebuild the index
lnc rebuild-index --path /data/lance/segments/<topic_name>

2.2 Lost Index File

Symptoms:

  • Fast lookups fail
  • Fallback to linear scan (slow reads)
  • Missing .idx file in topic directory

Recovery Steps:

# Rebuild index from segment data
lnc rebuild-index --path /data/lance/segments/<topic_name>

# Verify the rebuilt index
lnc inspect-index --path /data/lance/segments/<topic_name>/index.idx

2.3 WAL Corruption

Symptoms:

  • Server fails to start
  • Error: "WAL replay failed"
  • Incomplete writes before crash

Recovery Steps:

# 1. Backup current state
cp -r /data/lance /data/lance.backup

# 2. Attempt WAL repair
lnc repair --path /data/lance/wal

# 3. If repair fails, skip corrupt entries
# WARNING: This may lose recent uncommitted data
lnc repair --path /data/lance/wal --skip-corrupt

# 4. Restart the server
systemctl start lance

3. Data Verification

3.1 Full Topic Verification

# Verify all data for a topic
lnc verify-data --path /data/lance --topic <topic_name>

# Stop on first error (for scripting)
lnc verify-data --path /data/lance --topic <topic_name> --fail-fast

3.2 Segment-Level Verification

# Scan individual segment
lnc scan --path /data/lance/segments/<topic_name>/<segment>.lnc

# Output includes:
# - Record count
# - Offset range
# - CRC status for each record

3.3 Continuous Verification

For production, run periodic verification:

# Cron job (daily at 3 AM) — replace <topic_name> with your topic name
0 3 * * * /usr/local/bin/lnc verify-data --path /data/lance --topic <topic_name> >> /var/log/lance-verify.log 2>&1

4. Segment Recovery

4.1 Truncate Corrupt Tail

When a segment has corruption at the end (common after crash):

# 1. Find last valid record
lnc scan --path /data/lance/segments/<topic_name>/<segment>.lnc

# 2. Note the offset of last valid record

# 3. Truncate using repair
lnc repair --path /data/lance/segments/<topic_name>/<segment>.lnc

4.2 Remove Corrupt Segment

If a segment is completely unrecoverable:

# 1. Backup the corrupt segment
mv /data/lance/segments/<topic_name>/<segment>.lnc /data/lance/quarantine/

# 2. Update metadata to skip the segment
# This creates a gap in offsets - consumers must handle this

# 3. Rebuild index to reflect changes
lnc rebuild-index --path /data/lance/segments/<topic_name>

5. Node Failure Recovery

5.1 Leader Node Failure

Expected Behavior: Automatic failover within election timeout (150-300ms)

If Automatic Failover Fails:

# 1. Check cluster status from any healthy node
lnc cluster-status --server <healthy_node>:1992

# 2. If no leader elected, check quorum
# Need majority of nodes online for election

# 3. If split-brain suspected, isolate the partition
# Stop the minority partition nodes

# 4. Force election on majority partition
# (Automatic - nodes will elect after timeout)

5.2 Follower Node Recovery

# 1. Start the failed node
systemctl start lance

# 2. Node will automatically:
#    - Rejoin cluster
#    - Receive missed replication data
#    - Become available for reads

# 3. Verify node health
lnc cluster-status --server <recovered_node>:1992

5.3 Complete Node Replacement

When adding a new node to replace a failed one:

# 1. Update cluster configuration with new node address
# Edit lance.toml: cluster.peers

# 2. Start the new node
systemctl start lance

# 3. Node will receive full state via replication

# 4. Verify cluster health
lnc cluster-status --server <any_node>:1992

6. Cluster Recovery

6.1 Split-Brain Recovery

Symptoms:

  • Two leaders accepting writes
  • Divergent data between partitions

Recovery Steps:

# 1. Identify which partition has more recent data
# Check term numbers - higher term is authoritative

# 2. Stop nodes in the stale partition
systemctl stop lance  # on stale nodes

# 3. Export any unique data from stale partition
lnc export-topic --path /data/lance --topic <topic_name> --output /backup/stale_data.lance

# 4. Wipe stale partition data
rm -rf /data/lance/*

# 5. Restart nodes - they will sync from authoritative partition
systemctl start lance

# 6. Manually merge exported data if needed

6.2 Total Cluster Loss

Recovery from backup:

# 1. Start one node as standalone
lance --standalone --data-dir /data/lance

# 2. Import backup data
lnc import-topic --path /data/lance --name <topic_name> --input /backup/topic.lance

# 3. Repeat for all topics

# 4. Stop standalone mode, restart in cluster mode
systemctl restart lance

# 5. Add remaining nodes to cluster

7. Backup and Restore

7.1 Create Backup

# Export single topic
lnc export-topic --path /data/lance --topic <topic_name> --output /backup/<topic_name>.lance

# Export all topics (script)
for topic in $(lnc list-topics --server localhost:1992 | jq -r '.[].name'); do
    lnc export-topic --path /data/lance --topic "$topic" --output "/backup/${topic}.lance"
done

7.2 Restore from Backup

# Restore single topic
lnc import-topic --path /data/lance --name <topic_name> --input /backup/<topic_name>.lance

# Verify restored data
lnc verify-data --path /data/lance --topic <topic_name>

7.3 Backup Best Practices

  • Frequency: Daily full backup, continuous replication for DR
  • Retention: 7 days minimum, 30 days recommended
  • Testing: Monthly restore test to verify backup integrity
  • Storage: Off-site backup for disaster recovery

8. Emergency Procedures

8.1 Emergency Stop

If data corruption is spreading:

# Immediate stop all nodes
for node in node1 node2 node3; do
    ssh $node "systemctl stop lance"
done

8.2 Read-Only Mode

To prevent further writes while investigating:

# Set read-only flag (requires restart)
echo "read_only = true" >> /etc/lance/lance.toml
systemctl restart lance

8.3 Emergency Contacts

Role Contact
On-Call Engineer [PagerDuty/OpsGenie]
Database Lead [Email/Slack]
Infrastructure [Email/Slack]

Appendix: CLI Command Reference

Command Purpose
lnc verify-data Check data integrity
lnc repair Repair corrupt segments
lnc rebuild-index Rebuild topic index
lnc scan Scan segment contents
lnc export-topic Export topic for backup
lnc import-topic Import topic from backup
lnc cluster-status Check cluster health

↑ Back to Top | ← Back to Docs Index