Add transcript annotation schema

This commit is contained in:
rishikanthc
2026-04-28 09:53:25 -07:00
parent a066dc71a8
commit 4e13bb9d15
10 changed files with 487 additions and 266 deletions

View File

@@ -25,6 +25,7 @@ This folder contains the planning notes, architecture rules, sprint trackers, an
| `devnotes/automatic-summarization-sprint.md` | `devnotes/v2.0.0/sprint-plans/automatic-summarization-sprint.md` |
| `devnotes/summary-widgets-sprint.md` | `devnotes/v2.0.0/sprint-plans/summary-widgets-sprint.md` |
| `devnotes/llm-provider-settings-sprint.md` | `devnotes/v2.0.0/sprint-plans/llm-provider-settings-sprint.md` |
| highlights and notes backend sprint plan | `devnotes/v2.0.0/sprint-plans/highlights-notes-backend-sprint-plan.md` |
| `devnotes/sprint-tracker.md` | `devnotes/v2.0.0/sprint-trackers/api-revamp-sprint-tracker.md` |
| `devnotes/engine-worker-sprint-tracker.md` | `devnotes/v2.0.0/sprint-trackers/engine-worker-sprint-tracker.md` |
| `devnotes/api-sprint-0-inventory.md` | `devnotes/v2.0.0/status-updates/api-sprint-00-inventory-and-removal-plan.md` |

View File

@@ -0,0 +1,238 @@
# Highlights and Notes Backend Sprint Plan
## Current Assessment
The backend does not currently have a clean notes/highlights implementation ready for use.
What exists now:
- Notes and highlights are listed as deferred modules in `devnotes/v2.0.0/specs/api-v1-master-spec.md`.
- The canonical route family is reserved under `/api/v1/transcriptions/{id}/notes`.
- The current transcript API returns transcript text, segments, and words from `GET /api/v1/transcriptions/{id}/transcript`.
- Transcription records already carry `user_id`, so future ownership checks have a clear parent boundary.
What was removed as legacy backend code:
- The old `models.Note` persistence record.
- The generic `NoteRepository`.
- Old note table schema registration and indexes.
- Legacy note migration/backfill code.
- Database tests that asserted the old note shape.
Why the old note schema was not a good foundation:
- It only modeled notes, not highlights as a first-class resource.
- It mixed compatibility fields into the persistence model.
- It had weak anchoring for transcript text changes and retranscription.
- It used internal transcription IDs directly without a clean public response contract.
- It did not give enough room for future multi-user behavior such as ownership, sharing, or audit-safe mutation.
## Target Backend Model
Use a new annotation-oriented schema rather than resurrecting the old notes table.
Recommended table: `transcript_annotations`
Core columns:
```txt
id string primary key
user_id uint not null indexed
transcription_id string not null indexed
kind string not null enum-like: highlight | note
content text null
color string null
quote text not null
anchor_start_ms integer not null
anchor_end_ms integer not null
anchor_start_word integer null
anchor_end_word integer null
anchor_start_char integer null
anchor_end_char integer null
anchor_text_hash string null
status string not null default active
metadata_json json not null default {}
created_at timestamp
updated_at timestamp
deleted_at soft delete
```
Indexes and constraints:
- `idx_transcript_annotations_user_transcription_created_at` on `(user_id, transcription_id, created_at DESC)`.
- `idx_transcript_annotations_user_kind_updated_at` on `(user_id, kind, updated_at DESC)`.
- `idx_transcript_annotations_transcription_time` on `(transcription_id, anchor_start_ms, anchor_end_ms)`.
- Foreign key from `transcription_id` to `transcriptions(id)` with cascade delete.
- Check constraint or service validation for `kind IN ('highlight', 'note')`.
- Service validation that `anchor_end_ms >= anchor_start_ms`.
Multi-user readiness:
- Every query must include `user_id`; never fetch by annotation ID alone.
- Authorization should be inherited from transcription ownership for now.
- Keep `user_id` on the annotation even though the transcription also has it. This supports future shared transcripts where annotation ownership may differ from transcript ownership.
- Public IDs should be opaque and prefixed, for example `ann_...`.
- API responses should never expose raw database IDs.
## Target REST API
Use thin handlers and route through a notes/highlights service.
Recommended canonical routes:
```http
GET /api/v1/transcriptions/{id}/annotations
POST /api/v1/transcriptions/{id}/annotations
GET /api/v1/transcriptions/{id}/annotations/{annotation_id}
PATCH /api/v1/transcriptions/{id}/annotations/{annotation_id}
DELETE /api/v1/transcriptions/{id}/annotations/{annotation_id}
```
Supported filters:
```txt
kind=highlight|note
updated_after=RFC3339 timestamp
cursor=...
limit=...
```
Convenience aliases may be added only if the frontend needs resource-specific URLs:
```http
GET /api/v1/transcriptions/{id}/notes
GET /api/v1/transcriptions/{id}/highlights
```
Those aliases should call the same annotation service with a fixed `kind` filter. Do not create separate persistence paths.
Example create request:
```json
{
"kind": "note",
"content": "Follow up on this decision",
"color": "yellow",
"quote": "We should ship the smaller model first",
"anchor": {
"start_ms": 12400,
"end_ms": 18900,
"start_word": 42,
"end_word": 51,
"start_char": 280,
"end_char": 336,
"text_hash": "sha256:..."
}
}
```
Collection response:
```json
{
"items": [],
"next_cursor": null
}
```
## Sprint 1: Schema and Migration
Goal: add a clean annotations persistence model.
Tasks:
- Add `models.TranscriptAnnotation` with persistence fields only.
- Add schema migration for `transcript_annotations`.
- Add indexes for per-transcription list, per-kind list, and time-range lookup.
- Add database tests for fresh schema, foreign key behavior, soft delete, and user-scoped indexes.
- Keep legacy `notes` migration removed; old note data is intentionally not imported into the new model.
Acceptance criteria:
- Fresh databases create `transcript_annotations`.
- Deleting a transcription deletes its annotations.
- Invalid annotation kinds and invalid time ranges are rejected by service/model validation.
## Sprint 2: Repository and Service
Goal: keep HTTP thin and put annotation decisions behind a domain service.
Tasks:
- Add an annotation repository with domain methods:
- `CreateAnnotation`
- `FindAnnotationForUser`
- `ListAnnotationsForTranscription`
- `UpdateAnnotation`
- `SoftDeleteAnnotation`
- Add an annotation service that:
- verifies transcription ownership,
- parses public IDs,
- validates anchors,
- normalizes content for highlights vs notes,
- emits small annotation events.
- Add service tests for ownership, invalid IDs, kind validation, time-range validation, and soft-delete behavior.
Acceptance criteria:
- No new annotation handler reads `database.DB` directly.
- Annotation ID lookup is always scoped by `user_id` and `transcription_id`.
- Service methods return explicit not-found, validation, and conflict errors.
## Sprint 3: REST API
Goal: expose annotations through canonical v1 REST endpoints.
Tasks:
- Add request/response types separate from GORM models.
- Register routes under `/api/v1/transcriptions/{id}/annotations`.
- Implement list/create/get/update/delete handlers.
- Add pagination using the existing list-query helpers where appropriate.
- Add route contract tests and error-envelope tests.
- Add OpenAPI documentation for the new routes.
Acceptance criteria:
- Authenticated users can create, list, update, and delete their own transcript annotations.
- Other users cannot access annotations even if they know the public annotation ID.
- Responses use public IDs: `tr_...` and `ann_...`.
- API responses do not expose local paths or internal database-only fields.
## Sprint 4: Events and Transcript Integrity
Goal: make annotations live-update friendly and robust against transcript changes.
Tasks:
- Publish `annotation.created`, `annotation.updated`, and `annotation.deleted` SSE events.
- Include only public IDs and small cache-invalidation payloads.
- Add optional anchor verification against current transcript words/segments.
- Decide behavior on retranscription:
- keep annotations but mark anchors as stale when text hash no longer matches, or
- soft-delete annotations on destructive transcript replacement.
- Add tests for event payload shape and stale-anchor behavior.
Acceptance criteria:
- Annotation events are usable for cache invalidation.
- Full page refresh can reconstruct the correct state from persisted annotations.
- Retranscription behavior is explicit, tested, and documented.
## Sprint 5: Future Multi-User Hardening
Goal: avoid another schema rewrite when multi-user support arrives.
Tasks:
- Add tests that simulate two users with different transcriptions and annotations.
- Add service seams for future shared transcript permissions.
- Review whether annotation ownership should remain independent from transcription ownership.
- Add audit-friendly fields only if product requirements need them, such as `created_by_user_id` and `updated_by_user_id`.
Acceptance criteria:
- All annotation reads and writes are user-scoped.
- No repository method can accidentally return annotations across users.
- Future shared-transcript access can be added by changing authorization policy without changing the core annotation table.