Submissions from surgehq.ai

		EnterpriseBench: CoreCraft – Measuring AI Agents in Chaotic RL Environments (surgehq.ai)
		1 point by Olshansky 4 days ago \| past \| discuss
		Hemingway bench AI writing leaderboard (surgehq.ai)
		1 point by gervwyk 34 days ago \| past
		Evolving Instruction Following Beyond IFEval and "Avoid the Letter C" (surgehq.ai)
		1 point by gk1 46 days ago \| past \| 1 comment
		LMArena is a cancer on AI (surgehq.ai)
		246 points by jumploops 63 days ago \| past \| 100 comments
		LMArena Is a Cancer on AI (surgehq.ai)
		6 points by gk1 64 days ago \| past \| 1 comment
		LMArena Is a Cancer on AI (surgehq.ai)
		5 points by EvgeniyZh 89 days ago \| past \| 1 comment
		LMArena is a cancer on AI (surgehq.ai)
		4 points by holdingunsteady 3 months ago \| past \| 1 comment
		LMArena Is a Plague on AI (surgehq.ai)
		3 points by cui 3 months ago \| past \| 1 comment
		RL Environments and the Hierarchy of Agentic Capabilities (surgehq.ai)
		4 points by echen 4 months ago \| past
		Wall Street Experts Tested GPT-5 and Claude. Both Struggled – Even with Excel (surgehq.ai)
		5 points by holdingunsteady 4 months ago \| past \| 1 comment
		Is Sonnet 4.5 the best coding model in the world? (surgehq.ai)
		2 points by egillie 4 months ago \| past
		A Product Take on Sonnet 4.5 (surgehq.ai)
		1 point by gk1 4 months ago \| past
		Unsexy AI Failures: The PDF That Broke ChatGPT (surgehq.ai)
		4 points by gk1 5 months ago \| past
		The Human/AI Frontier: A Conversation with Bogdan Grechuk (surgehq.ai)
		1 point by gk1 5 months ago \| past
		Unsexy AI Failures: Still Confidently Hallucinating Image Text (surgehq.ai)
		2 points by gk1 5 months ago \| past
		AI agents still can't solve 1/3 of SWE-Bench problems. Why not? (A Case Study) (surgehq.ai)
		1 point by egilliehhc 5 months ago \| past
		SWE-Bench Failures: When Coding Agents Spiral into 693 Lines of Hallucinations (surgehq.ai)
		22 points by landonxi 5 months ago \| past \| 1 comment
		Extracting text from a pdf broke ChatGPT (surgehq.ai)
		7 points by landonxi 5 months ago \| past \| 2 comments
		The PDF That Broke ChatGPT (surgehq.ai)
		2 points by jasong 5 months ago \| past
		SurgeAI Blog: Human Evals vs. Academic Benchmarks (surgehq.ai)
		1 point by Olshansky 6 months ago \| past
		Unsexy AI Failures: The PDF That Broke ChatGPT (surgehq.ai)
		1 point by pr337h4m 6 months ago \| past
		Introduction to Reinforcement Learning with Human Feedback (surgehq.ai)
		1 point by CarrieLab on Jan 25, 2023 \| past
		Explaining Reinforcement Learning with Human Feedback (RLHF) (surgehq.ai)
		11 points by echen on Jan 5, 2023 \| past
		We Evaluated ChatGPT vs. Google on 500 Search Queries (surgehq.ai)
		25 points by amrrs on Dec 26, 2022 \| past \| 11 comments
		We Evaluated ChatGPT vs. Google on 500 Search Queries (surgehq.ai)
		5 points by holdingunsteady on Dec 23, 2022 \| past
		ChatGPT vs. Google Search (surgehq.ai)
		3 points by antman on Dec 22, 2022 \| past
		ChatGPT Crushes Google on Coding Queries, and Matches It at General Information (surgehq.ai)
		11 points by echen on Dec 21, 2022 \| past \| 1 comment
		AI Red Teams for Adversarial Training: Making ChatGPT and LLMs More Robust (surgehq.ai)
		9 points by echen on Dec 13, 2022 \| past
		HellaSwag: 36% of this popular large language model benchmark contains errors (surgehq.ai)
		49 points by echen on Dec 6, 2022 \| past \| 8 comments
		The Violence, Racism, & Sexism Uncaught by Twitter's Content Moderation Systems (surgehq.ai)
		3 points by echen on Nov 17, 2022 \| past
		More