Fine-tuning vs Prompting

NLP
LLM
Published

February 18, 2025

Problem

Large Language Models (LLMs) have become incredibly powerful, but they often need adaptation to perform specific tasks effectively. While fine-tuning model weights was traditionally the default approach, the massive size of modern LLMs makes it less practical. With LLMs’ strong general capabilities, prompt engineering has become a preferred method for adaptation. But how do these approaches compare in practice?

This post explores this question through a case study on joint entity-relation extraction, a fundamental NLP task used in constructing knowledge graphs. The task involves identifying both entities and their relationships from unstructured text.

Consider the following input text:

“Buzz Aldrin was born in Glen Ridge, New Jersey on January 20th, 1930.”

The extracted triples could be:

  • Buzz Aldrin | birth place | Glen Ridge, New Jersey

  • Buzz Aldrin | birth date | January 20th, 1930

Method and Experimental Setup

For the analysis, the WebNLG dataset was used, where each example consists of a natural language text and its corresponding entity-relation triples. The dataset was preprocessed by converting relations to plain English (e.g., “birthPlace” to “birth place”) and randomly concatenated 1–7 samples to create texts of varying lengths. This approach was intended to simulate real-world documents where context and content vary in length and complexity. However, it is acknowledged that this method may not fully capture the natural coherence and structural characteristics of actual documents. A conversation format version with user-assistant message pairs was also created for fine-tuning. The preprocessed dataset is available on Hugging Face. The train split was used for model adaptation and the dev split for benchmarking.

Experiments were conducted using Llama-3 8B, comparing fine-tuning and prompting approaches.

For fine-tuning, LoRA (Hu et al. 2021) with rank 64 was used to efficiently adapt the model weights using varying amounts of training data (100 to 8,870 examples). The configuration included:

  • Learning rate: 1.0e-4

  • Batch size: 8

  • Training epochs: 1

  • Scheduler: cosine with 0.1 warmup ratio

  • Mixed precision: bf16

For prompt optimization, the DSPy library (Khattab et al. 2024) was used to optimize prompts with different numbers of few-shot examples (0 to 24). All experiments used greedy decoding (temperature=0.0) to maintain consistency and reproducibility.

Performance was evaluated using three key metrics:

  1. Exact Match F1: This measures the model’s ability to extract triples that exactly match the ground truth, considering both precision and recall.

  2. Fuzzy Match F1: This more lenient metric allows for minor variations in entity boundaries and relation phrases while still capturing the core semantic relationships.

  3. Pairwise Comparisons: Qwen-2.5-32B was used to compare outputs from different models with LLM-as-a-judge technique (Zheng et al. 2023). For each test example, the judge was presented with:

    • The input text
    • Two competing model outputs (triples, with randomized order to eliminate position bias)
    • A rubric emphasizing factual correctness, completeness, and precision

The judge was instructed to evaluate the triples based on key criteria including completeness of relationship extraction, factual accuracy, strict alignment with source text, precision without overgeneration, absence of redundancy, and informative representation. For each comparison, the judge provided a concise explanation of its reasoning before making a final decision of A, B, or DRAW. These decisions were then aggregated into ELO ratings, providing a more nuanced evaluation of output quality beyond string matching. This comparison was applied to 100 samples randomly chosen from the dev split.

Results

The evaluation used both traditional metrics (F1 scores) and pairwise comparisons judged by another LLM. Here are the key results:

Table 1: Comparison of fine-tuning and prompting approaches across different metrics.

Adaptation Method Num Samples Optimizer Exact F1 Fuzzy F1 ELO Rating
Fine-tuning 8,870 AdamW 0.94 0.98 1502.4
Fine-tuning 1,000 AdamW 0.81 0.93 1506.5
Fine-tuning 100 AdamW 0.62 0.84 1448.4
Fine-tuning 16 AdamW 0.07 0.24 1430.9
Prompting 24 BFSRS-High 0.32 0.69 1507.8
Prompting 8 MIPROv2-Medium 0.24 0.63 1510.6
Prompting 0 None 0.02 0.34 1472.4

Figure 1: The impact of the number of few-shot examples on the F1 score.

The plot demonstrates that F1 scores initially improve rapidly with additional few-shot examples but begin to plateau after around 16 examples. This saturation effect suggests there may be limited benefit to including more than 16 examples in the prompt, as the marginal gains diminish.

Analysis of these results reveals several key patterns:

  1. Fine-tuning demonstrates strong but nuanced performance:
    • Performance scales clearly with data, improving from 0.07 to 0.94 Exact F1 score as shown in Table 1
    • The full dataset (8,870 examples) achieves the best F1 results
    • With very limited data (16 examples), fine-tuning performs poorly
    • But with reasonable amount of data (100 examples), the model produces reasonable results
    • Interestingly, using 1,000 examples led to higher ELO ratings than using 8,870 examples, suggesting possible overfitting to the dataset
    • This indicates that beyond a certain point, additional training data may lead to diminishing returns

  1. Prompt optimization demonstrates remarkable efficiency:
    • Using MIPROv2-Medium with just 8 examples achieved the highest ELO rating (1510.6) despite lower F1 scores
    • F1 scores improve rapidly with additional examples but plateau around 16 examples
    • These results indicate that prompt optimization may be more sample-efficient for achieving real-world quality
    • The approach requires significantly less computational resources compared to fine-tuning

  1. Zero-shot prompting underscores adaptation importance:
    • Without any examples, performance is poor (0.02 Exact F1, 0.34 Fuzzy F1)
    • Zero-shot approach yields the lowest performance across all metrics (ELO rating 1472.4)
    • Adding even a few examples leads to dramatic improvements, as illustrated in Figure 1
    • This clearly demonstrates that task-specific guidance is crucial for effective performance

Understanding Metric Discrepancy

The gap between classical metrics (F1 scores) and ELO ratings reveals important insights about evaluation:

  1. Classical metrics like F1 scores measure exact matching against ground truth, which may be overly strict:
    • Minor variations in phrasing that preserve meaning are penalized
    • Alternative but valid entity boundaries may reduce scores
    • Semantically equivalent relationships expressed differently lower F1 scores

  1. ELO ratings from the judge model capture more nuanced aspects:
    • Factual correctness and alignment with source text
    • Completeness of relationship extraction
    • Precision without redundancy or overgeneration
    • Overall quality and usefulness of the extracted information

This explains why prompt optimization can achieve high ELO ratings despite lower F1 scores - it may produce high-quality outputs that differ slightly from the exact ground truth format.

Practical Implementation Considerations

The choice between fine-tuning and prompt optimization involves several trade-offs:

  1. Computational Resources:
    • Fine-tuning requires significant compute for training and access to the model weights
    • Prompt optimization needs only inference resources; hence, it is applicable to proprietary models as well

  1. Deployment Flexibility:
    • Prompts can be modified quickly in production
    • Fine-tuned models require redeployment
    • Iteration speed varies significantly

  1. Data Requirements:
    • Fine-tuning needs large datasets for best results
    • Prompt optimization works well with limited examples
    • Data collection and preparation costs differ

  1. Maintenance:
    • Prompts are easier to version control and update
    • Fine-tuned models require careful weight management
    • Debugging complexity varies between approaches

Limitations and Future Work

The analysis has several limitations that point to interesting future directions:

  1. LLM-as-a-Judge Limitations:
    • A 32B parameter model was used for pairwise comparisons. Using larger models like Llama-70B or GPT-4o could provide more reliable evaluations.
    • The evaluation prompt could be further optimized. For instance, self-consistency prompting (Wang, Xuezhi et al. 2022) could be employed to improve the quality of the judge LLM.

  1. Task and Dataset Coverage:
    • Results are specific to entity-relation extraction. Testing on other NLP tasks (classification, Q&A) would provide broader insights.
    • Additional datasets would help validate findings.

  1. Statistical Significance of Results:
    • The differences in ELO ratings and F1 scores across configurations are relatively small.
    • Future work should include statistical significance testing to determine if these differences are meaningful.

Conclusion

The experiments reveal important nuances in the performance comparison between fine-tuning and prompt optimization approaches. While fine-tuning demonstrated superior performance in string matching metrics, these automated evaluations may not comprehensively reflect real-world utility. The ELO ratings from pairwise comparisons using a 32B parameter judge model suggest that prompt optimization can achieve comparable effectiveness in practice.

Based on this empirical analysis, we can draw several key conclusions about choosing between these approaches:

  • Fine-tuning demonstrates exceptional performance on tasks with clear, quantifiable metrics like exact string matching, as it can directly optimize for these specific objectives during training.
  • For tasks involving complex or subjective evaluation criteria, prompt optimization shows stronger performance, as evidenced by the ELO ratings.
  • When substantial training data (hundreds to thousands of examples), computational resources for training, and access to model weights are available, fine-tuning provides consistent and robust performance improvements.
  • In resource-constrained scenarios with limited data, compute, or model access, well-designed prompt optimization strategies can achieve competitive results while being more practical to implement.
  • Organizations requiring rapid experimentation or flexible deployment will benefit from prompt optimization’s easier versioning and updating capabilities.
  • For applications where automated evaluation metrics closely align with business success criteria, fine-tuning’s ability to directly optimize these metrics makes it particularly attractive.

The code for the experiments is available at this repository.

References

  1. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. Khattab, Omar and Singhvi, Arnav and Maheshwari, Paridhi and Zhang, Zhiyuan and Santhanam, Keshav and Vardhamanan, Sri and Haq, Saiful and Sharma, Ashutosh and Joshi, Thomas T. and Moazam, Hanna and Miller, Heather and Zaharia, Matei and Potts, Christopher. (2024). In The Twelfth International Conference on Learning Representations.
  2. WebNLG
  3. Llama-3 8B
  4. Qwen-2.5-32B
  5. LoRA: Low-Rank Adaptation of Large Language Models. Hu, J.E., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., & Chen, W. (2021). ArXiv, abs/2106.09685.
  6. Self-Consistency Improves Chain of Thought Reasoning in Language Models. Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E.H., & Zhou, D. (2022). ArXiv, abs/2203.11171.
  7. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. Zheng, L., Chiang, W., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., & Stoica, I. (2023). ArXiv, abs/2306.05685.

Appendix

Example Comparisons

To further illustrate the differences between fine-tuning and prompt optimization, I present a few selected examples from my model comparison, where A and B denote best performing fine-tuned and prompt optimized models respectively:

Example 1: Decision was A

**Text:**  
Cornell University is in Ithaca, New York and their nickname is Cornell Big Red. They are the publisher of Administrative Science Quarterly and are affiliated with the Association of American Universities.

**Triples A:**  
- "Administrative Science Quarterly | publisher | Cornell University"
- "Cornell University | nickname | Cornell Big Red" 
- "Cornell University | affiliation | Association of American Universities"
- "Cornell University | city | Ithaca, New York"

**Triples B:**  
- "Cornell University | city | Ithaca, New York"
- "Cornell University | nickname | Cornell Big Red"
- "Cornell University | academic publisher | Administrative Science Quarterly"
- "Cornell University | affiliation | Association of American Universities"


**Explanation:**  
Set A correctly identifies 'Cornell University' as the publisher of 'Administrative Science Quarterly', whereas Set B incorrectly reverses this relationship. Therefore, Set A is more correct and complete.

Example 2: Decision was B

  **Text:**  
  Alpena County Regional Airport runway name is 1/19. Swords is a part of the Dublin European Parliamentary constituency. The United States (with a leader, John Roberts) boasts the Bacon Explosion. American Abraham A. Ribicoff was born in New Britain, Connecticut and is married to Ruth Ribicoff. One of the ethnic groups in the U.S. are Native Americans. Georgia is in the country of United States. Atlanta; has a total area of 347.1 square kilometres, a population density of 1299.0 inhabitants per square kilometre, and the area codes: 404, 678, and 470. The birthplace of Alfred Garth Jones is Manchester, (part of Greater Manchester), England. In Manchester the Labour Party is in the majority.

  **Triples A:**  
  - "Alpena County Regional Airport | runway name | 1/19"
  - "Swords, Dublin | is part of | Dublin (European Parliament constituency)"
  - "Bacon Explosion | country | United States"
  - "United States | leader | John Roberts"
  - "Abraham A. Ribicoff | birth place | New Britain, Connecticut"
  - "Abraham A. Ribicoff | spouse | Ruth Ribicoff"
  - "United States | ethnic group | Native Americans in the United States"
  - "Abraham A. Ribicoff | nationality | American"
  - "Georgia (U.S. state) | country | United States"
  - "Atlanta | area code | 404, 678, 470"
  - "Atlanta | population density | 1299.0 (inhabitants per square kilometre)"
  - "Atlanta | area total | 347.1 (square kilometres)"
  - "Alfred Garth Jones | birth place | England"
  - "Alfred Garth Jones | birth place | Manchester"
  - "Manchester | is part of | Greater Manchester"
  - "Manchester | leader | Labour Party (UK)"

  **Triples B:**  
  - "Alpena County Regional Airport | runway name | 1/19"
  - "Swords | is part of | Dublin (European Parliament constituency)"
  - "United States | leader | John Roberts"
  - "United States | dish variation | Bacon Explosion"
  - "Abraham A. Ribicoff | spouse | Ruth Ribicoff"
  - "Abraham A. Ribicoff | birth place | New Britain, Connecticut"
  - "United States | ethnic group | Native Americans in the United States"
  - "Georgia (U.S. state) | country | United States"
  - "Atlanta | area code | 404, 678, 470"
  - "Atlanta | population density | 1299.0 (inhabitants per square kilometre)"
  - "Atlanta | total area | 347.1 (square kilometres)"
  - "Alfred Garth Jones | birth place | Manchester, England"
  - "Manchester | has to its majority | Labour Party"
  - "Manchester | is part of | Greater Manchester"

  **Explanation:**  
  Model B's triples are complete, correct, and consistent with the text. It correctly states the birth place of Alfred Garth Jones as 'Manchester, England' and avoids the duplication issue. Therefore, Model B's triples are better.

Example 3: Decision was DRAW

  **Text:**  
  The ISBN number of A Long Long Way is 0-670-03380-4.

  **Triples A:**  
  - "A Long Long Way | isbn number | 0-670-03380-4"

  **Triples B:**  
  - "A Long Long Way | isbn number | 0-670-03380-4"

  **Explanation:**  
  Both model A and model B extract the same triple, which is complete, correct, consistent with the text, and does not contain duplicates. Therefore, both models perform equally well.