When sending a hash lookup to ScreenScraper, romnom was always set to the
archive filename on disk (e.g. Mario.zip). For single-file archives, the hash
is computed from the internal file (e.g. mario.n64), so sending the archive
name sends slightly incorrect info to ss.fr during a KO scrape.
When archive_members has exactly one entry, romnom now uses that member's
name. Multi-file archives and non-archive files continue to use the filesystem
filename unchanged.
Closes#3444
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
FSAssetsHandler.compute_content_hash and _compute_zip_hash were
building full paths via f"{self.base_path}/{file_path}". self.base_path
is already a pathlib.Path (resolved by FSHandler.__init__), so the
f-string forced it to str, hard-coded the separator, and re-parsed --
fine on Linux but a footgun if a caller ever sneaks a leading slash or
the path needs Path semantics elsewhere.
Switch both spots to self.base_path / file_path, which is what every
other FSHandler subclass in this module already does (e.g.
FSRomsHandler, FSResourcesHandler, FSSyncHandler all join Path objects
directly).
Three sync callsites (endpoints/sync.py, sync_watcher.py, and both
branches of tasks/sync_push_pull_task.py) ran get_saves(...) and then
discarded archival null-slot rows in a Python list comprehension. On
libraries with many archival/web-UI uploads that's a strict waste:
those rows are pulled from MariaDB, hydrated into Save model instances,
and then immediately filtered out.
Add a slot_not_null bool kwarg to DBSavesHandler.get_saves and apply
the filter in the SQL query. Update all four callsites to use it and
drop the Python-side comprehension. Default stays False so unrelated
callers keep the current behavior.
get_all_saves() materialized every Save row across all users into a
single .all() list. On instances with very large libraries that's a
real RAM ceiling and pins every row for the lifetime of the recompute
run.
Replace it with get_saves_after_id(after_id, limit) and have the
recompute task drive keyset pagination in PAGE_SIZE-row chunks. SQLAlchemy
streaming via .execution_options(yield_per=...) is incompatible with the
per-call session lifetime that @begin_session enforces (the session
exits before the consumer iterates), so keyset paging from the caller is
the cleanest fit.
Behavior is unchanged: same row coverage, same idempotency, same
counters. Memory usage drops from O(all saves) to O(PAGE_SIZE).
Cleanup pass on save-sync addressing three independent failure modes
that interact in production data: content_hash drift between client
and server, null-slot archival saves leaking into sync flows, and
content-hash dedupe collapsing legitimately-distinct slots.
Bug fixes
- compute_content_hash dispatched on zipfile.is_zipfile(relative_path),
which silently returned False whenever the process's CWD wasn't
ASSETS_BASE_PATH. Every zip save fell through to the raw-MD5 branch,
persisting hashes that disagreed with clients computing the intended
per-entry zip-hash. Resolve to a full path before the dispatch.
- _build_negotiate_plan, sync_push_pull_task, and sync_watcher all
treated null-slot saves as sync-eligible. Null-slot saves represent
web-UI / archival uploads; including them in negotiate plans matched
them against device pushes by filename and overwrote archival data.
Filter null-slot saves at all three call sites.
- get_save_by_content_hash matched on (rom_id, user_id, content_hash)
only, so identical bytes uploaded to different slots collapsed into
one record. Scope the lookup by slot when provided so clone-save-
to-new-slot creates a distinct row per slot.
- get_save_by_filename matched on (rom_id, user_id, file_name) only.
When two uploads to different slots happened in the same wall-clock
second (the datetime tag is per-second), the second upload UPDATED
the first record's slot instead of creating a distinct row. Scope
the filename lookup by slot too.
One-shot recovery
- New recompute_save_content_hashes manual task walks every Save row,
recomputes via the fixed dispatch, and updates rows whose values
differ. Idempotent; safe to re-run.
- Backend startup runs a COUNT(content_hash IS NULL) query and, if
any rows exist, enqueues the recompute task on the low-priority
RQ queue. The API process moves on; the worker handles the
recompute out-of-band. Subsequent restarts find zero NULL hashes
and skip. Admins can also trigger the task manually.
Test infrastructure
- Added tests/_zipfile_shim.reload_zipfile() mirroring the pattern
from utils/zip_cache.py for the same zipfile-inflate64 + CPython
3.13.5 incompatibility. Test fixtures that build ZIPs call it
immediately before opening the archive.
Internal members of multi-file archives (zip/tar/7z/rar) are now hashed
individually (crc/md5/sha1) and stored in a new `archive_members` JSON
column on the archive's RomFile, alongside the existing composite hash
used for hash-database matching. Only the archive itself is surfaced as
a RomFile so full_path keeps pointing at a file that exists on disk,
which is the constraint that previously forced us to choose between
composite-only or broken downloads.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Consolidate all archive readers (zip/tar/7z/rar) and 7z-internal helpers
into a single utils/archives.py module to keep the archive surface area
in one place.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-internal-member RomFiles produced full_paths that didn't exist on
disk, breaking downloads and zip-building. Stream entries into the
composite hash only and emit one RomFile pointing at the archive itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous validator did a preflight `socket.getaddrinfo` before each
httpx request. Two problems:
* DNS rebinding / TOCTOU: httpx re-resolves at connect time, so a
hostname can answer with a public IP for the validator and a
private IP for the real request. The preflight check did not
constrain the connection.
* Event-loop blocking: `socket.getaddrinfo` is synchronous, and the
media-download callers are async. Slow resolvers stalled
unrelated requests.
Replace it with two layers, both wired automatically onto every httpx
client built by `utils.context`:
1. A request event hook running `validate_url_for_http_request`
(syntactic checks only: scheme, reserved hostnames, literal IPs,
internal TLDs). No DNS, no call-site responsibility.
2. `SSRFProtectedAsyncBackend` / `SSRFProtectedSyncBackend`, custom
httpcore network backends that resolve the hostname inside
`connect_tcp`, reject any address in a forbidden range, then
connect to that *same* validated address. The async variant uses
`loop.getaddrinfo` so it doesn't block the loop. httpcore calls
`start_tls(server_hostname=<URL host>)` after `connect_tcp`, so
TLS SNI and cert verification still use the original hostname
even though the TCP layer connects by IP.
Drop the explicit `validate_url_for_http_request(...)` calls from
`resources_handler.py` — the event hook covers them. Consolidate the
URL validator and its tests under `utils/ssrf.py` /
`tests/utils/test_ssrf.py` so the SSRF surface lives in one module.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply the helper to the three other per-file FileHash constructions
(folder-walk hash, empty-archive fallback, single-file hash). The
all-empty FileHash literals are left alone since the helper would be
strictly more obscure for that case.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add read_rar_archive_files via the existing 7zz binary (which natively
handles RAR3/RAR5 read), and collapse the per-extension reader dispatch
into an ARCHIVE_READERS dict so future formats are one entry away. Also
extract a small _make_file_hash helper to remove the repeated nested
ternaries in the inner loop.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pull file/archive readers (zip/tar/gz/bz2/7z), CHD parsing, and the
shared libmagic MIME detector out of roms_handler.py into a new
utils/archives.py. Rename the previously underscore-prefixed
read_zip_archive_files / read_tar_archive_files to match the existing
read_7z_archive_files convention, and consolidate the duplicated
"with lock: detector.from_file()" pattern into a detect_mime_type helper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standard media fields (url_cover, url_manual, url_screenshots) were downloaded
using the stored credential-less URLs, causing them to count against the anonymous
IP quota instead of the user's SS account. Apply add_ss_auth_to_url() at each
download call site in the scan and ROM update paths.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(screenscraper): guard add_ss_auth_to_url against non-SS URLs
Only inject ssid/sspassword into screenscraper.fr URLs to prevent
leaking user credentials to third-party sources (IGDB, LaunchBox, etc.)
when url_cover/url_manual/url_screenshots originate from other providers.
Add tests for the non-SS no-op and empty-string edge cases.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
test(screenscraper): verify SS credentials injected for all media download paths
- TestAddSsAuthToUrl: add guards for non-SS URLs (IGDB, LaunchBox) and
empty string inputs
- test_update_rom: verify ssid/sspassword appear in url_cover and
url_manual args passed to get_cover/get_manual for screenscraper.fr
URLs; verify IGDB URLs are NOT decorated with SS credentials
- TestScanCredentialInjection: verify the scan-path ternary pattern
correctly applies add_ss_auth_to_url to cover and screenshot URLs,
and that a None cover URL passes through without error
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
test(screenscraper): empirical audit — every SS request carries ssid/sspassword
Intercepts both HTTP clients at the transport/session level to verify
that every outgoing screenscraper.fr request is decorated with the user's
ssid and sspassword credentials:
aiohttp (API calls via auth_middleware):
- jeuInfos.php, jeuRecherche.php, ssinfraInfos.php, ssuserInfos.php
httpx (media downloads via FSResourcesHandler):
- get_cover → url_cover
- get_manual → url_manual
- get_rom_screenshots → url_screenshots (each URL)
- store_media_file → extra media (fanart, bezel, etc.)
Also verifies the domain guard: IGDB URLs passed through add_ss_auth_to_url
are NOT decorated with SS credentials.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The paginated ROM list eager-loaded sibling_roms via selectinload, which
hydrated full Rom ORM instances (including heavy JSON metadata columns)
for every sibling even though only an existence/count check was needed
on the frontend. On large collections this dominated request latency.
Split sibling handling by response shape:
- SimpleRomSchema (list): siblings is now list[int]; populated per page
by a single SELECT against the sibling_roms view projecting only
(rom_id, sibling_rom_id) — no Rom row hydration.
- DetailedRomSchema (detail): keeps full SiblingRomSchema objects, with
load_only on (id, name, fs_name_no_tags, fs_name_no_ext) so sibling
rows stop dragging in JSON metadata.
Frontend usage already only consumes siblings.length on list views; the
detail-page VersionSwitcher continues to receive the richer schema.
Drop the migration and the multi_file / top_level_file_count columns on
roms; express both as deferred column_property correlated subqueries
against rom_files instead. The gallery list and detail queries opt in
via undefer, so they get the values computed in the same SELECT via
indexed subqueries (rom_id index already in place); other code paths
that don't read the flags pay nothing.
This keeps the gallery perf win (no rom_files load for cards) without
introducing schema state that has to stay in sync with rom_files at
write time.
ASGI spec only allows headers on the http.response.start message;
appending Set-Cookie to body messages is out-of-spec and may break on
some servers. Early-return for non-start messages.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The gallery list endpoint was eager-loading every rom_file row for each
paginated ROM via selectinload, then re-joining each row back to its
parent rom for the is_top_level computation. For platforms with extracted
multi-file ROMs (Xbox 360 ~1394 files/ROM, Switch ~199 files/ROM), this
made /api/roms time out at 120s even with a rom_id index.
Cards never displayed individual files — only the has_simple_single_file
/ has_nested_single_file / has_multiple_files booleans that derive from
the file list. Denormalize the underlying state onto roms as multi_file
(folder-based vs single-file) and top_level_file_count, recompute the
booleans from those columns, drop the selectinload from filter_roms, and
move the files field from SimpleRomSchema to DetailedRomSchema so the
gallery payload no longer ships file rows.
Also drop the redundant joinedload(RomFile.rom) and switch the relation
to lazy="select" so subsequent file.rom accesses resolve from the
session identity map instead of re-JOINing the parent rom per file row.
ShowQRCode.vue's folder-based DS/3DS fallback now fetches the detailed
rom on demand, since SimpleRom no longer carries files.
Apply the same lazy-factory pattern to FSLaunchboxHandler and FSSyncHandler
that ssh_sync_handler now uses. With both opt-in features deferred to
first-use, the tolerate_missing_base escape hatch on FSHandler is no longer
needed — every handler now fails loudly on mkdir failure, which is the
right behavior for the always-on core paths (assets, library, resources).
Touched call sites:
- resources_handler._resolve_local_file_uri (launchbox)
- sync_watcher.py, endpoints/device.py, tasks/manual/sync_folder_scan.py
(fs_sync)
Net effect:
- Default installs never poke /romm/launchbox or /romm/sync at startup.
- Misconfigured opt-in users get a clear, actionable PermissionError at
the call site instead of a silent warning followed by mystery failures.
- tolerate_missing_base, its tests, and one stale log import are gone.