Self-Consistency
Intent
Generate multiple independent reasoning paths for the same problem and select the most consistent answer through majority voting.
Problem
A single chain-of-thought can lead to wrong answers because the model happened to take a bad reasoning path. Different reasoning paths might lead to different answers, and you have no way to know which one is correct from a single sample.
Solution
Sample multiple reasoning chains independently (using temperature > 0) and take a majority vote on the final answers. The intuition is that correct reasoning paths tend to converge on the same answer, while incorrect paths tend to be diverse. The most common answer across many samples is most likely correct. This is essentially ensemble methods applied to LLM reasoning.
Diagram
┌→ [Reasoning Path 1] → Answer: 42
│
Question → [Sample N paths] → [Reasoning Path 2] → Answer: 42
│
├→ [Reasoning Path 3] → Answer: 37
│
└→ [Reasoning Path 4] → Answer: 42
Majority Vote → 42 ✓When to Use
- Mathematical reasoning where there's a single correct answer
- When you need high confidence in the result
- Tasks where multiple reasoning approaches exist
- Critical decisions where errors are costly
When NOT to Use
- Open-ended creative tasks with no single correct answer
- When cost per query is a constraint (requires N× calls)
- Simple tasks where the model rarely makes errors
Pros & Cons
Pros
- Significantly higher accuracy than single-sample CoT
- Simple to implement — just sample and vote
- No additional training or fine-tuning needed
- Confidence correlates with vote margin
Cons
- N× cost increase (typically 5-40 samples needed)
- Only works for tasks with definitive answers
- Higher latency if samples aren't parallelized
- Diminishing returns beyond a certain sample count
Implementation Steps
- 1Identify tasks where the model gives inconsistent answers
- 2Generate N reasoning chains with temperature > 0 (typically N=5 to 40)
- 3Extract the final answer from each chain
- 4Apply majority voting (or weighted voting) to select the answer
- 5Use vote margin as a confidence signal
- 6Tune N based on your accuracy/cost tradeoff
Real-World Example
Arithmetic Word Problem
For a complex word problem, 5 reasoning chains are generated. Three arrive at '156', one at '142', one at '156.5'. Majority vote selects '156' with 60% confidence. The two incorrect paths made different mistakes, while the three correct paths converged.
from openai import OpenAI
from collections import Counter
client = OpenAI()
def self_consistency(question: str, n_samples: int = 5) -> dict:
answers = []
for _ in range(n_samples):
response = client.chat.completions.create(
model="gpt-4o",
temperature=0.7,
messages=[{
"role": "user",
"content": f"{question}\n\nThink step by step. State your final answer after ANSWER:",
}],
)
text = response.choices[0].message.content
if "ANSWER:" in text:
answers.append(text.split("ANSWER:")[-1].strip())
votes = Counter(answers)
winner, count = votes.most_common(1)[0]
return {"answer": winner, "confidence": count / len(answers), "votes": dict(votes)}