kingstonTime
10 hours ago
Most teams treat skills, MDC rules and system prompts as write-once artifacts, refined by vibes. The post looks at two practical approaches to actually measuring whether they work: deterministic rubric testing and paired comparisons borrowed from RLHF.
It also covers token cost as a forcing function for justifying what context stays.
Curious whether others have built eval pipelines for their prompt context.