Magic Words Need Measuring Sticks

1 pointsposted 10 hours ago
by kingstonTime

1 Comments

kingstonTime

10 hours ago

Most teams treat skills, MDC rules and system prompts as write-once artifacts, refined by vibes. The post looks at two practical approaches to actually measuring whether they work: deterministic rubric testing and paired comparisons borrowed from RLHF.

It also covers token cost as a forcing function for justifying what context stays.

Curious whether others have built eval pipelines for their prompt context.