Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper
The tool schema is not boilerplate. It is the interface between human intent and model behavior.
When you define a function for an LLM to call, you write a JSON schema: a name, a description, and a parameters object. That schema is transmitted to the model as part of its input. The model reads it, interprets it, and decides how to call your function. The quality of that decision depends almost entirely on the quality of what you wrote.
This is a fact, not a guideline. Azure Databricks documents it plainly: "using a simpler JSON schema for function call definitions results in higher quality function call JSON generation." Heavily nested schemas result in lower quality generation. The platform restricts which JSON Schema features are allowed — because complex composition operators degrade model output.
The failure modes are consistent across models:
data1, data2) instead of intent-revealing names (start_date, order_id) reduce accuracy.enum is possible give the model unnecessary degrees of freedom — and it fills them badly.description fields force the model to guess both what a parameter does and when the tool applies.The scale of the problem is not marginal. Research on small-to-mid-range models (4B–14B parameters) shows that JSON schema — a format designed for machine validation, not language model interpretation — is itself a protocol mismatch. At production catalog sizes of 15 or more tools, JSON-baseline accuracy for these models falls to 0–49%. Not because the models are bad at reasoning. Because they were handed a spec written for a parser, not a reader.
Reformatting the same schema into structured text — no model retraining, no fine-tuning, just a different representation — recovers accuracy to 65–90% across the same models. Phi-4 14B goes from 0% to 84.4% at 20 tools. The information content is identical. The format was the problem.
Write the schema like you are writing documentation for a careful engineer. That is exactly what the model is doing: reading your documentation and trying to call your function correctly.
get_order_status, not order.enum everywhere a closed set applies.description, including what the function returns and when not to call it.The model cannot tell you the schema was ambiguous. It just gets it wrong, silently. You profile the latency, interrogate the prompt, swap in a larger model. The schema was the problem the whole time.
1. Sakizli, F. (2026). TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments. arXiv:2605.04107. https://arxiv.org/abs/2605.04107
2. Microsoft / Azure Databricks. Function calling on Azure Databricks. Microsoft Learn. https://learn.microsoft.com/en-us/azure/databricks/machine-learning/model-serving/function-calling
3. Meijer, E. (2025). From Function Frustrations to Framework Flexibility. ACM. https://doi.org/10.1145/3722544
4. Skripko, N. (2025). IFEval-FC: Instruction-Following Evaluation in Function Calling for Large Language Models. arXiv:2509.18420. https://arxiv.org/pdf/2509.18420
5. OpenAI. Function calling. OpenAI API Documentation. https://developers.openai.com/api/docs/guides/function-calling