Three sync callsites (endpoints/sync.py, sync_watcher.py, and both
branches of tasks/sync_push_pull_task.py) ran get_saves(...) and then
discarded archival null-slot rows in a Python list comprehension. On
libraries with many archival/web-UI uploads that's a strict waste:
those rows are pulled from MariaDB, hydrated into Save model instances,
and then immediately filtered out.
Add a slot_not_null bool kwarg to DBSavesHandler.get_saves and apply
the filter in the SQL query. Update all four callsites to use it and
drop the Python-side comprehension. Default stays False so unrelated
callers keep the current behavior.
get_all_saves() materialized every Save row across all users into a
single .all() list. On instances with very large libraries that's a
real RAM ceiling and pins every row for the lifetime of the recompute
run.
Replace it with get_saves_after_id(after_id, limit) and have the
recompute task drive keyset pagination in PAGE_SIZE-row chunks. SQLAlchemy
streaming via .execution_options(yield_per=...) is incompatible with the
per-call session lifetime that @begin_session enforces (the session
exits before the consumer iterates), so keyset paging from the caller is
the cleanest fit.
Behavior is unchanged: same row coverage, same idempotency, same
counters. Memory usage drops from O(all saves) to O(PAGE_SIZE).
Cleanup pass on save-sync addressing three independent failure modes
that interact in production data: content_hash drift between client
and server, null-slot archival saves leaking into sync flows, and
content-hash dedupe collapsing legitimately-distinct slots.
Bug fixes
- compute_content_hash dispatched on zipfile.is_zipfile(relative_path),
which silently returned False whenever the process's CWD wasn't
ASSETS_BASE_PATH. Every zip save fell through to the raw-MD5 branch,
persisting hashes that disagreed with clients computing the intended
per-entry zip-hash. Resolve to a full path before the dispatch.
- _build_negotiate_plan, sync_push_pull_task, and sync_watcher all
treated null-slot saves as sync-eligible. Null-slot saves represent
web-UI / archival uploads; including them in negotiate plans matched
them against device pushes by filename and overwrote archival data.
Filter null-slot saves at all three call sites.
- get_save_by_content_hash matched on (rom_id, user_id, content_hash)
only, so identical bytes uploaded to different slots collapsed into
one record. Scope the lookup by slot when provided so clone-save-
to-new-slot creates a distinct row per slot.
- get_save_by_filename matched on (rom_id, user_id, file_name) only.
When two uploads to different slots happened in the same wall-clock
second (the datetime tag is per-second), the second upload UPDATED
the first record's slot instead of creating a distinct row. Scope
the filename lookup by slot too.
One-shot recovery
- New recompute_save_content_hashes manual task walks every Save row,
recomputes via the fixed dispatch, and updates rows whose values
differ. Idempotent; safe to re-run.
- Backend startup runs a COUNT(content_hash IS NULL) query and, if
any rows exist, enqueues the recompute task on the low-priority
RQ queue. The API process moves on; the worker handles the
recompute out-of-band. Subsequent restarts find zero NULL hashes
and skip. Admins can also trigger the task manually.
Test infrastructure
- Added tests/_zipfile_shim.reload_zipfile() mirroring the pattern
from utils/zip_cache.py for the same zipfile-inflate64 + CPython
3.13.5 incompatibility. Test fixtures that build ZIPs call it
immediately before opening the archive.
Apply the same lazy-factory pattern to FSLaunchboxHandler and FSSyncHandler
that ssh_sync_handler now uses. With both opt-in features deferred to
first-use, the tolerate_missing_base escape hatch on FSHandler is no longer
needed — every handler now fails loudly on mkdir failure, which is the
right behavior for the always-on core paths (assets, library, resources).
Touched call sites:
- resources_handler._resolve_local_file_uri (launchbox)
- sync_watcher.py, endpoints/device.py, tasks/manual/sync_folder_scan.py
(fs_sync)
Net effect:
- Default installs never poke /romm/launchbox or /romm/sync at startup.
- Misconfigured opt-in users get a clear, actionable PermissionError at
the call site instead of a silent warning followed by mystery failures.
- tolerate_missing_base, its tests, and one stale log import are gone.
The module-level SSHSyncHandler() singleton ran filesystem side effects
(mkdir on SYNC_SSH_KEYS_PATH) at import time, which meant even default
installs with push-pull sync disabled would touch /romm/sync — and the
previous tolerate-and-warn fallback could leave users wondering why
sync silently does nothing.
Replace the eager singleton with a functools.cache'd factory. The
handler is now constructed on first use, so:
- Default-install users (ENABLE_SYNC_PUSH_PULL=false, no manual sync
triggered) never touch /romm/sync.
- Users opting in get a clear, actionable RuntimeError pointing at
the unwritable path and the env var to override, at the call site
rather than buried in a startup stack trace.
Also document in env.template that enabling either sync mode requires
a writable volume at $ROMM_BASE_PATH/sync.
- Restore NoResultFound behavior on update_session, complete_session,
fail_session when row is missing (scalar returns None, old .one()
raised -- silent None is a semantic regression)
- Remove redundant get_session call from _increment_session_counter;
the atomic SQL increment is already a no-op on missing rows
- Log warning when passed session_id is not found in _sync_device
instead of silently creating an orphan session
- Fix broken path construction in FSSyncHandler: build_* methods now
return relative paths; sync_watcher uses paths relative to sync base
instead of CWD (was completely non-functional in production)
- Fix SSH connection leak in push-pull task: conn.close() now in finally
- Add log.warning for disabled SSH host key verification
- Fix race condition in session operation counter: use atomic SQL
increment instead of read-then-write
- Extract _increment_session_counter helper, add exc_info to warnings
- Replace legacy session.query() with select() in sync_sessions_handler
- Fix orphaned session: trigger_push_pull now passes session_id to job
- Fix wasteful SSH download when no matched_save exists
- Fix BaseModel import collision in sync.py (pydantic -> project base)
- Fix ORM mutation in UserSchema.from_orm_with_request: set field on
schema instance instead of mutating live ORM object
- Mask ssh_password and ssh_key_path in DeviceSchema API response
- Fix migration PostgreSQL compatibility: condition ON UPDATE clause
on MySQL, drop enum in downgrade
- Rename copy-paste artifact rom_user_status_enum