Data loading

# Description

## Background

### ETL & Data Synchronization Strategies

#### 1. Synchronization Types
*   **Full Sync**: A complete data reload from the source to the target DB. This is the most straightforward approach to guarantee 100% data consistency.
*   **Full Scheduled Sync**: Automated full reloads triggered by a scheduler at specific intervals.
*   **Delta Sync (Incremental)**: Synchronizes data based on a high-water mark (e.g., `LastModifiedDate` or `AutoIncrementID`). 
    *   *Note*: This strategy acts as an **Audit Log**. If a record is deleted in the source, the target preserves the history, ensuring no data loss for analytical purposes.
*   **Delta Scheduled Sync**: Periodic incremental updates to keep the target DB near-current with minimal overhead.

#### 2. Scheduling Architecture
*   The **Worker Microservice** manages execution. Scheduled jobs are persisted in the service's own database. The worker polls these jobs and executes the sync logic, ensuring decoupling from the main application API.

#### 3. Data Transformation Pipeline (Join/Union/Column)
To avoid the common bugs associated with in-place mutations, we use a **Functional/Snapshot approach**:

*   **Immutable Source Datasets**: The raw data from the source remains untouched (Source of Truth).
*   **Transformed Data Snapshots**: Instead of modifying the buffer, we create a "View" or "Snapshot" that references the source but contains the applied transformations (Join, Union, or Column mapping).
*   **Stream Processing via Channels**: 
    *   Transformations are performed during the transfer process using `System.Threading.Channels`.
    *   This allows for **Batch Overloading**: as soon as a batch is read from the source, it's pushed into a transformation channel and then immediately written to the target. This minimizes memory pressure and maximizes throughput.
*   **Schema Evolution**: Supports adding, modifying, or removing columns and type casting on-the-fly within the pipeline.

### High-Performance "On-the-Fly" Transformation Engine

Processing massive datasets involves heavy CPU and Memory overhead. To achieve near-native performance, the engine utilizes a **Zero-copy Pipeline** and **SIMD-accelerated** transformations.

#### 1. Native Memory Management & GC Avoidance
*   **Off-Heap Hash Tables**: Performing In-Memory Joins on large streams usually triggers massive GC pressure. By implementing custom Hash Tables in **Native Memory**, we can process tens of gigabytes in RAM without a single "Stop-the-World" pause.
*   **Pre-allocated Buffers**: Instead of allocating millions of `string` or `DateTime` objects, the Reader writes raw data into pre-allocated memory segments (`NativeMemory` or `ArrayPool`).

#### 2. SIMD & C++ Interop (P/Invoke)
For computationally expensive tasks, we offload processing to a C++ Engine:
*   **Batch Processing**: Instead of individual calls, we pass a pointer to an entire **Memory Segment (Batch)** to the C++ Transformer.
*   **SIMD Vectorization**: The C++ engine uses vector instructions to process multiple rows or columns simultaneously (e.g., parsing 8-16 dates or numeric values in a single CPU cycle).
*   **Zero-copy Handover**: C# (Reader) fills the buffer -> C++ (Transformer) processes it in-place -> C# (Writer) performs a BulkInsert. Only pointers are passed between layers, ensuring zero data copying overhead.

#### 3. Asynchronous Pipeline (System.Threading.Channels)
The architecture follows a "Producer-Consumer" pattern, decoupled via Channels to maximize hardware utilization:

*   **Thread 1 (Reader)**: Pulls "Chunk A" from the source DB and pushes its pointer to the first Channel. It immediately proceeds to "Chunk B," ensuring the DB connection is never idle.
*   **Thread 2 (Transformer)**: Consumes the pointer, calls the C++ Engine via P/Invoke, and awaits completion. While C++ is crunching numbers using all available CPU cores, the Reader continues filling the next buffer.
*   **Thread 3 (Writer)**: Takes the processed pointer from the second Channel and executes a BulkInsert into the target DB.

**Result**: A fully non-blocking pipeline where I/O (Reading/Writing) and CPU-bound tasks (Transforming) run in parallel, saturating both Network and CPU bandwidth.

```csharp
// Rent a piece of memory (via NativeMemory)
IntPtr buffer = NativeMemory.Alloc(1024 * 1024 * 10); // 10 MB

// Pass it to C++ for transformation
// One P/Invoke call for the entire batch!
CppEngine.TransformBatch(buffer, rowCount, transformType); 

// In C++ (inside DLL)
void TransformBatch(void* data, int count, int type) {
    // Work with raw memory as quickly as possible
    // Use C++ multithreading (std::parallel_policy)
}
```

### Data Ingestion & I/O Optimization

To feed the high-performance pipeline, we minimize overhead at the very first stage — reading from the source database.

#### 1. Direct Buffer Loading (ADO.NET)
Instead of using high-level ORMs (like Entity Framework) that create thousands of objects, we use low-level **DbDataReader**. 
*   **Mechanics**: Methods like `reader.GetBytes()` or `reader.GetValues()` allow us to stream data directly into buffers rented from `ArrayPool` or allocated via `NativeMemory`.
*   **Benefit**: This bypasses unnecessary string allocations and intermediate object mapping, keeping the data in a binary-friendly format for the C++ engine.

#### 2. Native Binary Streaming (Bulk Connectors)
For maximum throughput, we utilize database-specific streaming APIs:
*   **Specific APIs**: Using tools like **NpgsqlBinaryExporter** for PostgreSQL or **SqlBulkCopy** (for writes) in SQL Server.
*   **Binary Pass-through**: This is the fastest possible method. Data arrives from the socket in its native binary format with almost zero parsing on the .NET side. We simply "pipe" the data from the network socket directly into the buffer for the C++ Transformer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data loading #29

Description

Background

ETL & Data Synchronization Strategies

1. Synchronization Types

2. Scheduling Architecture

3. Data Transformation Pipeline (Join/Union/Column)

High-Performance "On-the-Fly" Transformation Engine

1. Native Memory Management & GC Avoidance

2. SIMD & C++ Interop (P/Invoke)

3. Asynchronous Pipeline (System.Threading.Channels)

Data Ingestion & I/O Optimization

1. Direct Buffer Loading (ADO.NET)

2. Native Binary Streaming (Bulk Connectors)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Data loading #29

Description

Description

Background

ETL & Data Synchronization Strategies

1. Synchronization Types

2. Scheduling Architecture

3. Data Transformation Pipeline (Join/Union/Column)

High-Performance "On-the-Fly" Transformation Engine

1. Native Memory Management & GC Avoidance

2. SIMD & C++ Interop (P/Invoke)

3. Asynchronous Pipeline (System.Threading.Channels)

Data Ingestion & I/O Optimization

1. Direct Buffer Loading (ADO.NET)

2. Native Binary Streaming (Bulk Connectors)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions