feat: filter + cosine-similarity core allocation with Redis accounting#62
Conversation
Replace First-Fit core selection with a hard availability filter followed
by cosine-similarity weighting, and move resource accounting into Redis so
CPU becomes a real constraint and load spreads across cores.
- selection (service/core_allocation.go): hard filter (alive + cpu/mem/disk
AND, with overcommit and reserve) -> cosine similarity between a core's
remaining-ratio vector and the request-ratio vector; tie-break by
remaining magnitude then lowest index. Pure logic (chooseBest/feasible/
cosineSimilarity/ratio) is split out and unit-tested.
- accounting (service/alloc_redis.go): per-core HASH core:{ip}:{port}:alloc
on the existing Redis instance (6379), namespaced apart from VM-status
keys. GetCoreAlloc / IncrCoreAlloc (TxPipeline) / SetCoreAlloc /
RebuildCoreAllocFromDB (idempotent rebuild from DB sums on startup).
- double-booking: Lock -> SelectCore -> IncrCoreAlloc(+) -> Unlock as one
critical section; slow I/O stays outside the lock; cleanup() rolls the
reservation back on failure. Single-process assumption documented.
- CPU capacity: measure logical CPUs via /getStatusHost (vcpu_status.total)
and cache per core; drop the FreeCPU=9999 hardcode.
- healthcheck (service/healthcheck.go): 30s goroutine refreshing core
liveness/capacity; IsAlive is now bidirectional (recovered cores re-enable).
- config: cpu_overcommit / mem_reserve_pct / disk_reserve_pct in config.yaml
with env overrides and safe defaults.
- DeleteVM: release alloc and clean up VMInfoIdx/VMLocation/AliveVM
(previously leaked).
Free* fields are demoted to a display-only cache; availability is derived
from CoreInfoIdx - Redis alloc - reserve.
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@ga111o
|
There was a problem hiding this comment.
Pull request overview
This PR refactors VM core allocation from a simple first-fit approach to a hard-filter + cosine-similarity scoring model, and moves resource accounting to Redis-backed per-core allocation totals. It also introduces periodic health checks to refresh core availability/capacity and removes older repository/resource-manager abstractions in favor of methods directly on ControlContext.
Changes:
- Add new core selection algorithm (hard feasibility filter + cosine similarity tie-breaking) and unit tests for the pure decision logic.
- Introduce Redis HASH-based per-core allocation accounting and rebuild-on-startup reconciliation from DB.
- Add a periodic healthcheck goroutine to refresh core liveness and capacity, and update VM create/delete flows to reserve/rollback allocations.
Reviewed changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/docker-compose.test.yml | Adjust MySQL healthcheck and container restart behavior for tests. |
| structure/vm.go | Add allocation-tuning parameters to config (overcommit/reserve). |
| structure/resource_manager.go | Remove legacy in-memory resource manager implementation. |
| structure/repository.go | Remove legacy VM repository interface. |
| structure/mysql_vm_repository.go | Remove legacy MySQL repository implementation (moved into ControlContext). |
| structure/control_infra.go | Move VM DB persistence methods into ControlContext; add context mutex helpers. |
| startup/init.go | Replace old repo/resource-manager initialization; add CPU total measurement via core API. |
| startup/core_ip_config.go | Apply allocation defaults and env overrides for new config parameters. |
| service/vm.go | Switch CreateVM/DeleteVM to new SelectCore + Redis reservation/rollback + in-memory tracking updates. |
| service/redis.go | Store/read VM Redis records using client/model.VMRedisInfo directly. |
| service/network.go | Update CMS subnet allocation calls to new CMS client API; remove CMS delete helper. |
| service/healthcheck.go | New periodic core healthcheck to refresh liveness/capacity. |
| service/guacamole.go | Update locking to use ControlContext mutex instead of removed Resources. |
| service/dto.go | Remove service-layer DTOs; API now passes client model structs directly. |
| service/core_allocation.go | New hard-filter + cosine similarity core selection logic. |
| service/core_allocation_test.go | Unit tests for selection math/feasibility/tie-breaking. |
| service/cleanup.go | Remove old cleanup-chain helper (replaced by inline cleanup closure). |
| service/alloc_redis.go | New Redis HASH accounting for per-core alloc + DB-based rebuild on startup. |
| resources/config.yaml | Add defaults for cpu_overcommit/mem_reserve_pct/disk_reserve_pct. |
| main.go | Rebuild alloc-Redis on startup and start healthcheck goroutine. |
| client/vm.go | Document and use /getStatusHost CPU-total retrieval for capacity. |
| client/model/vm.go | Add vcpu_status.total support to host CPU status response; define VMRedisInfo/constants. |
| client/cms.go | Rename/reshape CMS subnet request/response; remove CMS delete API. |
| api/get_vm_status.go | Return client model status structs directly (removing API-specific wrappers). |
| api/create_vm.go | Decode request directly into client/model.CreateVMRequest and call service with it. |
| .env.example | Add env vars for allocation tuning parameters. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| fmt.Printf("%s\n", subnetReq.IP) | ||
| fmt.Printf("%s\n", subnetReq.MacAddr) | ||
| fmt.Printf("%s\n", subnetReq.SdnUUID) | ||
|
|
| req.SdnUUID = subnetReq.SdnUUID | ||
| req.MacAddr = subnetReq.MacAddr | ||
| req.NetConf.NetType = 0 | ||
| req.Users[0].SSHAuthorizedKeys = []string{publicKeyOpenSSH} |
| fmt.Printf("%s\n", subnetReq.IP) | ||
| fmt.Printf("%s\n", subnetReq.MacAddr) | ||
| fmt.Printf("%s\n", subnetReq.SdnUUID) | ||
|
|
| cleanup.run() | ||
| log.Error("CreateVM: at least one user is required for Guacamole configuration", true) | ||
| return fmt.Errorf("CreateVM: at least one user is required") | ||
| uuid := vms.UUID(req.UUID.String().(string)) |
| coreClient := client.NewCoreClient(core) | ||
| if _, err := coreClient.DeleteVM(context.Background(), model.DeleteVMRequest{ | ||
| _, err := coreClient.DeleteVM(ctx, model.DeleteVMRequest{ | ||
| UUID: uuid, | ||
| Type: model.HardDelete, | ||
| }); err != nil { | ||
| }) | ||
| if err != nil { | ||
| log.Error("error deleting VM %s on core %s: %v", uuid, core.IP, err) | ||
| return fmt.Errorf("DeleteVM: failed to delete VM %s on core %s: %w", uuid, core.IP, err) | ||
| } | ||
|
|
||
| cmsClient := client.NewCmsClient() | ||
| if err := DeleteCmsSubnet(cmsClient, contextStruct, uuid); err != nil { | ||
| log.Error("DeleteVM: failed to delete CMS subnet for VM %s: %v", uuid, err) | ||
| // CMS 삭제 실패는 로그만 남기고 삭제 자체는 성공으로 처리 (추가적인 수동 정리 필요) | ||
| err = contextStruct.DeleteInstance(uuid) | ||
| if err != nil { | ||
| log.Error("error deleting instance %s from ControlContext: %v", uuid, err) | ||
| return fmt.Errorf("DeleteVM: failed to delete instance %s: %w", uuid, err) | ||
| } | ||
| if cleanupErr := guacamole.Cleanup(string(uuid), contextStruct.GuacDB); cleanupErr != nil { | ||
| log.Error("Failed to cleanup Guacamole config during rollback: %v", cleanupErr) | ||
| } | ||
|
|
| func (c *ControlContext) FindCoreByVmUUID(uuid UUID) *Core { | ||
| log := util.GetLogger() | ||
|
|
||
| // Searching in-memory cache | ||
| c.Resources.RLock() | ||
| if core, ok := c.Resources.VMLocation[uuid]; ok { | ||
| c.Resources.RUnlock() | ||
| return core | ||
| } | ||
| c.Resources.RUnlock() | ||
|
|
||
| // If not found in cache, query the repository | ||
| coreIdx, err := c.VMRepo.GetInstanceLocation(uuid) | ||
| coreIdx, err := c.GetInstanceLocation(uuid) | ||
| if err != nil { | ||
| log.Error("Core not found for VM UUID %s", uuid, true) | ||
| return nil | ||
| } | ||
| c.Resources.Lock() | ||
| defer c.Resources.Unlock() | ||
| if coreIdx < 0 || coreIdx >= len(c.Resources.Cores) { | ||
| log.Error("Core index %d out of range for VM UUID %s", coreIdx, uuid, true) | ||
| return nil | ||
| return &c.Cores[coreIdx] | ||
| } |
| last_subnet := ctx.Last_subnet | ||
| next_last_subnet := pkgnetwork.FindSubnet(last_subnet) | ||
| log.Info("NewCmsSubnet : next_last_subnet: %s", next_last_subnet) | ||
|
|
||
| //CMS 호출 전에 다음 서브넷을 선점하여 동시 호출 시 중복 할당 방지 | ||
| _, err := ctx.DB.Exec("UPDATE subnet SET last_subnet = ? WHERE id = 1", nextLastSubnet) | ||
| // DB를 먼저 업데이트하여 서브넷을 선점한다. | ||
| // CMS 호출 전에 선점해야 실패 시 동일 서브넷이 중복 할당되는 것을 방지할 수 있다. | ||
| _, err := ctx.DB.Exec("UPDATE subnet SET last_subnet = ? WHERE id = 1", next_last_subnet) | ||
| if err != nil { | ||
| log.Error("Failed to update last_subnet in database: %v", err) | ||
| return nil, fmt.Errorf("NewCmsSubnet: failed to update last_subnet in DB: %w", err) | ||
| } | ||
| ctx.Last_subnet = nextLastSubnet | ||
| ctx.Last_subnet = next_last_subnet | ||
|
|
| util.RespondError(w, http.StatusBadRequest, "Memory, CPU, and Disk must be non-zero") | ||
| return | ||
| } | ||
|
|
| var rows *sql.Rows | ||
| rows, err = tx.QueryContext(ctx, "SELECT info.uuid, loc.core, info.inst_ip, info.guac_pass, info.inst_vcpu, info.inst_mem, info.inst_disk FROM inst_loc loc JOIN inst_info info ON loc.uuid = info.uuid") | ||
| if err != nil { | ||
| log.Error("Failed to get joined instance info: %v", err) | ||
| return nil, nil, err | ||
| } | ||
|
|
| Usage float64 `json:"usage_percent"` | ||
| // Desc는 호스트 /getStatusHost(host_dataType=0) 응답에만 존재(runtime.NumCPU()). | ||
| // VM별 /getStatusUUID 응답에는 없으므로 포인터로 두어 미존재를 nil로 감지한다. | ||
| Desc *VCPUStatus `json:"vcpu_status"` |
Summary
VM 코어 할당 로직을 First-Fit → 하드 필터(가용성 AND) + 코사인 유사도 가중치로 교체하고, 자원 회계를 Redis 기반으로 정비합니다.
/getStatusHost(vcpu_status.total)로 실측·캐시 (기존FreeCPU=9999하드코딩 제거 → CPU가 실질 제약으로 동작)core:{ip}:{port}:alloc) 로 집계 (기존 Redis 6379 인스턴스 공용)config.yaml/env로 도입IsAlivestale 문제 해소)CreateVM/DeleteVM의 예약·롤백·정리 경로 보강Motivation
현행 코어 선택(service/vm.go의 First-Fit)은 다음 한계가 있었습니다.
FreeCPU가9999로 하드코딩되어 vCPU 오버커밋이 무제한.Free*단일 카운터 + DB로 분산, 오버커밋/reserve 개념 없음.IsAlive가 시작 시 1회만 설정되어 죽은/복귀 코어를 반영하지 못함.목표는 (1) CPU 용량 실측, (2) Redis 기반 할당 집계, (3) 필터+코사인으로 요청 형태에 맞는 코어 선택을 통해 자원 분산 개선 + CPU 실질 제약화입니다.
Approach
선택 알고리즘 (service/core_allocation.go, 신규)
req.cpu ≤ logical_cpu*overcommit − allocated_cpureq.mem ≤ total_mem − (allocated_mem + total_mem*mem_reserve_pct)req.disk ≤ total_disk − (allocated_disk + total_disk*disk_reserve_pct)chooseBest,feasible,cosineSimilarity,ratio)을 분리해 Redis 없이 단위 테스트.자원 회계 (Redis 공용) (service/alloc_redis.go, 신규)
core:{ip}:{port}:alloc(HASH cpu/mem/disk). VM status 키(UUID)와 접두어가 달라 기존 Redis(6379) 인스턴스를 그대로 공용(별도 인스턴스 신설하지 않음).GetCoreAlloc/IncrCoreAlloc(TxPipeline 3×HIncrBy)/SetCoreAlloc/RebuildCoreAllocFromDB(시작 시 DB 합계로 멱등 재구성, stale 코어 0 초기화).더블 부킹 방지:
Lock() → SelectCore → IncrCoreAlloc(+) → Unlock()을 한 임계구역으로 묶어 read-then-decide를 직렬화. 느린 I/O(CMS/Guacamole/Core/createVM)는 락 밖. 실패 시cleanup()이 예약을IncrCoreAlloc(-)로 롤백.CPU 실측 (startup/init.go, client/model/vm.go):
vcpu_status응답 필드를 추가해 코어당 1회 실측·캐시. 실패 시IsAlive=false.헬스체크 (service/healthcheck.go, 신규): 30s 주기로 락 밖에서
/getStatusHost조회 후 락 안에서 반영(CreateVM과 동일 규율).IsAlive양방향(복귀 코어 재활성화).대안 대비: 별도 alloc-Redis(6380)도 검토했으나 운영 단순화를 위해 기존 인스턴스 공용으로 결정(키 네임스페이스 분리로 충돌 없음).
Type of Change
Related Issue
Testing
cosineSimilarity(방향/직교/영벡터 NaN),ratio(경계),feasible(차원별 경계 + overcommit/reserve),chooseBest(가드 제외/shape 최적/타이브레이커).Checklist
변경 파일
신규:
service/core_allocation.go,service/alloc_redis.go,service/healthcheck.go,service/core_allocation_test.go수정:
structure/vm.go,resources/config.yaml,startup/core_ip_config.go,startup/init.go,client/model/vm.go,client/vm.go,service/vm.go,main.go,.env.example리뷰 포인트
config.yaml기본값:cpu_overcommit: 4.0,mem_reserve_pct: 0.1,disk_reserve_pct: 0.1(envCPU_OVERCOMMIT/MEM_RESERVE_PCT/DISK_RESERVE_PCT로 오버라이드).Free*필드는 표시용 캐시로 격하(할당 판단은CoreInfoIdx − Redis alloc − reserve). 완전 제거는 후속 옵션.DeleteVM에 기존 누락돼 있던VMInfoIdx/VMLocation/AliveVM정리를 함께 보완.