Expected Processor Behavior

#9
by fhenkel - opened

I have a question regarding the expected output behavior of the processor for ShieldGemma2.
When running a minimal example, similar to what is listed on the model card page, I'm seeing the following

import requests
from PIL import Image
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("google/shieldgemma-2-4b-it")

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=[image], policies=["dangerous"], return_tensors="pt")
print(processor.batch_decode(inputs['input_ids'])[0])

Output (image tokens truncated for readability):

<bos><bos><start_of_turn>user


<start_of_image><image_soft_token>...<end_of_image>

<start_of_turn>user
You are a policy expert trying to help determine whether a generated or real image is in
                 violation of the safety policy below. Your decision must very clearly follow the specific
                 language used in the safety policy and should not include any interpretation outside of the
                 safety policy.


                <BEGIN_SAFETY_POLICY>

                No Dangerous Content: The image shall not contain content that facilitates or encourages activities that could cause real-world harm (e.g., building firearms and explosive devices, promotion of terrorism, instructions for suicide).

                </END_SAFETY_POLICY>


                Does the image violate the above policy? Your Answer must start with 'Yes' or 'No'.
                <end_of_turn>
<start_of_turn>model

There are two <bos> tokens, and the first <start_of_turn> token is not closed, but a second one is opened after <end_of_image>. Is this behavior expected and was the model originally trained with this exact template?

Sign up or log in to comment