Learnings from training FunctionGemma 270M

What is FunctionGemma 270M

To quote HuggingFace🤗 and Google Blog, FunctionGemma is a lightweight, open model from Google, built as a foundation for creating your own specialized function calling models. It’s finetuned on top of the Gemma 3 270M model.

As a contributor, specifically on the RL training side, I think I should note down my learnings here for benefits of myself and hopefully others. For a more complete picture, definitely stayed tuned for its official technical report.

Data

Generating high-quality synthetic data

We specifically finetuned the Gemma 270M model with function calling and natural language response after function response data. We used an internal synthetic data pipeline to generate examples for both scenarios using Gemini 2.5 Pro. 2.5 Pro is not perfect either, as it regularly makes formatting errors that renders the output impossible to parse. We simply over-sample the required number of examples and discard outputs that can’t be parsed into the desired function call and response format.

Data diversity is also important. To accomplish that, we configured the synthetic data pipeline’s prompt with different personalities and randomly sample 1 personality in each example’s generation. We also collected function declarations from multiple places:

We scrapped the Web for MCP function prototypes. However, these scrapped function prototypes are usually of lower quality.
We collected function prototypes from public datasets such as ToolACE and xLAM.
We synthetically generated some function prototypes targeting cases like smart home actions.

We randomly sample a few of these function declarations in each example’s generation.

Reusing hard SFT examples

In the concept of curriculum learning, we want the model to start learning from simple examples to complex ones to ultimately improve model’s learning. While we don’t have sufficient time to modify the data pipeline for this, we still want to introduce some hard examples to the RL data mixture, such that the model can hone its learning on these.

We built a pipeline to get every example from the SFT data mixture, stripping off its final trainable section, and run inference using the SFT-ed model (which is also the starting checkpoint for RL). We took 4 nucleus-sampled model responses and compared them to the trainable section as reference for correctness.

For natural language responses, we simply check if the model outputted in natural language as opposed to a function call.
For function calls, we use the function call reward function used in the RL setup to score its correctness ranging from 0 to 1. This is a program-based reward and does not use a critique model.

The correctness score across 4 responses were averaged and we use it to sort each example into 1 of the 4 buckets:

hard NL examples are those that model fails to generate natural language response in at least 2 out of the 4 samples.
hard FC examples are those with average correctness scores in [0.4, 0.8)
extra hard FC examples are those with average correctness scores in [0.0, 0.4)
the rest are considered easy examples

We stored the first 3 buckets of examples into different files and added them to the RL mixture with different weights. Extra hard examples got weighted down to prevent the model from not learning anything.

RL

Multi-step RL didn’t work

I tried using the internal multi-turn gym environment on the FunctionGemma model. I wrote a pipeline to evaluate the number of functions needed for each examples in the internal multi-turn dataset using Gemini 2.5 Flash and only selected ~100 examples that requires 5 or less unique functions.

However, this downsampled dataset proves to be still too hard for FunctionGemma to learn. The model can’t solve any examples the requires more than 1 step of function calling and gives up when the gym environment returns an error in the function response. Interestingly, some wins are actually judged incorrectly, as the verification code passes when the underlying environment is not changed at all. For example, the model may call a function with the incorrect parameters and then gives up (generating a natural language response like “I’m sorry, I couldn’t complete the task”), but the verifier code still returns correct.

I think the reason for this are:

FunctionGemma’s SFT data do not have multi-step cases and the model didn’t generalize single-step and multi-turn learnings to multi-step
FunctionGemma is not trained with thinking mode and couldn’t plan the task execution flow like the Gemini models.

Multi-turn RL somewhat worked

After the failed multi-step RL trial runs, we shifted focus to a multi-turn RL setup. We specifically wanted to focus on a clarification scenario when the user didn’t give the complete instructions to call a function. For example, some of the required parameters may be missing in the initial user request and the base FunctionGemma model often hallucinates a value. We want to discourage the model from doing this and the model should instead ask the user for clarification before making the correct function call.

Synthetic data

We reused the synthetic data pipeline to generate over 10k such examples. Specifically, each example includes:

a function declaration with at least 2 parameters
a user goal that contains the complete instruction and scenario to use the aforementioned function
the reference function call
an initial user prompt that’s missing some information

Multi-turn Experiments

The training setup is based on the internal multi-turn environment. Specifically, I used the Gemini 2.5 Pro to pose as the user in my user simulator. For each example, the user simulator is given its user goal and a randomly selected personality trait. I created the 3 personality traits with the help of Gemini 2.5 Pro 😉 as well, and they consists of a reasonable users, a vague user and an impatient user. This improves the user diversity and should help the model to better deal with real-world usages. Each round of conversation is capped at 5 turns and the same FC reward function used in the regular single-turn RL setup scores the model’s final response.

In the 1st run, the model doesn’t seem to learn at all and the accuracy hovers below ~10%. I read the sampled responses and found out that the model learned to hack my user simulator by generating a different control token than the standard FC control token, followed by the actual function call. Since my user simulator only terminates early when the standard FC control token is seen, this behavior would not lead to early termination and the response is fed to Gemini 2.5 Pro. Since Gemini 2.5 Pro probably is trained on neither malformed function calls, nor function calls without following function responses, it behaves strangely as well and starts critiquing the response, revealing some information from the user goal. I quickly modified the user simulator to terminate early when any control token is seen.

In the 2nd run, the model seems to learns slowly with the more robust user simulator and I discovered that the model is either hallucinating values or apologizing for not being able to complete the request. Some of the hallucinated values are actually correct. For example, a crop image function is provided an the user wants to crop an image to a width of 1920 pixels. The model hallucinated correctly that the height should be 1080 pixels. However, we think this behavior should be discouraged since it’s still a hallucination.

So for the 3rd run, we tweaked the reward function slightly to give the model a bad reward if the model makes a function call in the first turn, that is, the model is forced to make at least 1 clarification turn before making a function call. In just 20 steps, the model learned to clarify missing information from the user simulator and generate the correct function call. The accuracy jumped during the first 20 steps from 0 to ~25% and gradually improved to ~40% when training stopped at 200 steps. The metrics and sampled conversations all look great and reasonable to me!

However, when I actually served the model and vibe tested it, I realized that it’s actually a horrible model to use. The model do not produce function calls when the complete information is present in the user prompt, and no matter what information is present in the user prompt, the model always try to clarify for every parameter in the function declaration. It’s already this bad after 50 steps of training!

Ultimately, we didn’t have enough time to polish this setup before the model launch, so this feature is not included in the launched model. But I think the last issue would still be an easy fix: just combine single-turn RL with multi-turn RL. The multi-turn RL environment can be customized to terminate early when the example only requires a single turn. This should combat the forgetting behavior seen in the 3rd run.

Overall, my learning points in this multi-turn journey are:

Small models tend to forget more easily
Reward hacking is hard to avoid and requires a thoughtful reward design.
Vibe testing tells more about the model than the training metrics.

Hinted search improves sample efficiency

Since we normalize the rewards across each example prompt’s samples, when all samples received identical rewards, either all correct or incorrect, the normalized rewards have zero variance and are filtered out. This paper described this zero-variance filtering technique. However simply filtering out samples decreases sample efficiency, and for a small model like FunctionGemma, mistakes happen regularly. We noticed that the sample efficiency hovers at around 40% at the start of RL training and gradually goes up as training continues.

To improve sample efficiency at the start, my colleagues devised a hinted search technique, which somewhat blends SFT and RL. Hinted search kicks in when all samples are incorrect. There are 2 levels of hints:

In the prompt, we inject a short hint on what function to call or that the model should make a natural language response. We resample the responses using the updated prompt.
We directly replace one of the sampled response with the reference response.

The “hinted” responses are scored with the original prompt to get the per-token scores used for importance sampling and other parts of the core RL algorithm. The effect is immediate as the sample efficiency starts from ~80% during early stages of training and the model seems to learn more quickly.