The Schema Is the Spec

The tool schema is not boilerplate. It is the interface between human intent and model behavior.

When you define a function for an LLM to call, you write a JSON schema: a name, a description, and a parameters object. That schema is transmitted to the model as part of its input. The model reads it, interprets it, and decides how to call your function. The quality of that decision depends almost entirely on the quality of what you wrote.

This is a fact, not a guideline. Azure Databricks documents it plainly: "using a simpler JSON schema for function call definitions results in higher quality function call JSON generation." Heavily nested schemas result in lower quality generation. The platform restricts which JSON Schema features are allowed — because complex composition operators degrade model output.

The failure modes are consistent across models:

Generic parameter names (data1, data2) instead of intent-revealing names (start_date, order_id) reduce accuracy.
Open string fields where an enum is possible give the model unnecessary degrees of freedom — and it fills them badly.
Missing or vague description fields force the model to guess both what a parameter does and when the tool applies.
Deeply nested objects push relevant information far from the model's attention. Flat schemas outperform nested ones.
Overlapping tool descriptions with identical vocabulary cause the model to pick the wrong tool, consistently.

The scale of the problem is not marginal. Research on small-to-mid-range models (4B–14B parameters) shows that JSON schema — a format designed for machine validation, not language model interpretation — is itself a protocol mismatch. At production catalog sizes of 15 or more tools, JSON-baseline accuracy for these models falls to 0–49%. Not because the models are bad at reasoning. Because they were handed a spec written for a parser, not a reader.

Reformatting the same schema into structured text — no model retraining, no fine-tuning, just a different representation — recovers accuracy to 65–90% across the same models. Phi-4 14B goes from 0% to 84.4% at 20 tools. The information content is identical. The format was the problem.

Write the schema like you are writing documentation for a careful engineer. That is exactly what the model is doing: reading your documentation and trying to call your function correctly.

Name things with verbs and intent: get_order_status, not order.
Use enum everywhere a closed set applies.
Put the key context in the description, including what the function returns and when not to call it.
Flatten where possible. Nest only when structure is genuinely meaningful.
Disambiguate overlapping tools explicitly — name what the other tool handles.

The model cannot tell you the schema was ambiguous. It just gets it wrong, silently. You profile the latency, interrogate the prompt, swap in a larger model. The schema was the problem the whole time.

References

1. Sakizli, F. (2026). TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments. arXiv:2605.04107. https://arxiv.org/abs/2605.04107

2. Microsoft / Azure Databricks. Function calling on Azure Databricks. Microsoft Learn. https://learn.microsoft.com/en-us/azure/databricks/machine-learning/model-serving/function-calling

3. Meijer, E. (2025). From Function Frustrations to Framework Flexibility. ACM. https://doi.org/10.1145/3722544

4. Skripko, N. (2025). IFEval-FC: Instruction-Following Evaluation in Function Calling for Large Language Models. arXiv:2509.18420. https://arxiv.org/pdf/2509.18420

5. OpenAI. Function calling. OpenAI API Documentation. https://developers.openai.com/api/docs/guides/function-calling