Stabilize search pipeline and improve preview diagnostics

2026-03-13 18:32:54 +09:00
parent 6f3149a443
commit 7dfb1ad2de
8 changed files with 463 additions and 45 deletions
@@ -23,6 +23,33 @@
  - `go build ./backend` from repo root conflicts with the existing `backend/` directory name
  - verified build command is now treated as `go build -o /tmp/... ./backend`

+## Current Session Update (2026-03-13, Search/Preview Follow-up)
+- Investigated a production search failure using downloaded frontend logs.
+- Identified the main timeout cause:
+  - too many search results were being collected
+  - too many Gemini Vision batches were being evaluated sequentially
+  - backend debug messages were broadcasting oversized result payloads
+- Applied search pipeline optimization:
+  - reduced per-source result caps
+  - reduced query fan-out for Google Video
+  - reduced enrichment cap
+  - limited Gemini Vision evaluation to top-ranked candidates only
+- Improved Google Video filtering:
+  - added bans for music/BGM/trailer-style noise results
+- Improved Envato enrichment fidelity:
+  - source page metadata is now preferred over search-engine proxy thumbnails
+  - source snippet/title are now taken from page metadata when available
+  - preview mp4 extraction now works via HTML/JSON-LD parsing
+  - added Python HTML fetch fallback for Cloudflare-challenged Envato pages because Go HTTP alone was receiving 403 challenge pages in testing
+- Improved Artgrid fidelity:
+  - source page title/description/thumbnail are now preferred over search-engine snippets when available
+  - preview extraction is still not considered solved for all Artgrid clips because public HTML tested here did not expose a stable mp4/m3u8 URL
+- Improved logging:
+  - backend search debug events now emit summaries, timings, source counts, preview counts, and Gemini batch stats instead of giant raw arrays
+  - frontend now logs raw non-JSON error bodies instead of collapsing them to `{}` on gateway/proxy failures
+- Improved result rendering:
+  - search cards now show source snippet/description separately from AI reason to reduce confusion between asset metadata and Gemini commentary
+
 ## Local Self-Test Workflow
 - Primary command:
  - `bash scripts/selftest.sh`
@@ -145,7 +172,8 @@
 - Gemini batch evaluation exists, but search quality can still degrade if upstream SearXNG results are noisy.
 - Frontend JavaScript was not linted with Node tooling in this environment because `node` is not installed here.
 - Full browser-level preview validation is still not covered by the local self-test script.
- Search cards still render recommendation reason text, not a robust asset description/snippet mapping.
+- Search cards now separate source snippet from AI reason, but metadata fidelity still depends on source enrichment quality.
+- Artgrid public pages inspected from this environment still did not expose a stable public preview video URL in HTML, so Artgrid hover-video support may remain partial until a browser-captured HTML/HAR sample reveals the real preview source pattern.

 ## Frontend Debug Logger
 - UI button: bottom-right `Logs`
@@ -215,6 +243,7 @@
 - [ ] Better matching between rendered description and actual linked asset
 - [ ] Add browser-level verification for preview/HLS behavior
 - [ ] Add more automated coverage for search ranking / filtering logic
+- [ ] If Artgrid hover preview is still required, collect one real clip HTML/HAR from a browser session and derive a stable preview URL parser
 - [ ] Add proper frontend build/lint step if Node becomes available

 ## Verified Locally In This Environment