Text Overlay Detection

Text overlays are widely used for subtitles, credits, watermarks, promotional messages, and explanatory labels. There are many use cases for which we may want to detect and/or remove text overlay โ€“ avoiding burn-in text when training image and video generation models, supplying clean content for ad creatives, removing burn-in text from diffing algorithms, and creating paired data for title treatment and other text generation tasks.

This model was trained on 2k pairs of data sampled using a VLM as a weakly supervised classifier. The 2k data was then manually annotated. The published model uses DinoV2 w/ Regsiters backbone and a modified preprocessor in order to remove center cropping (text overlays are often in the corners of images!).

How To Use

import torch
from PIL import Image
from transformers import AutoImageProcessor
from transformers import AutoModelForImageClassification

image_processor = AutoImageProcessor.from_pretrained("aslakey/text_overlay_detection")
model = AutoModelForImageClassification.from_pretrained('aslakey/text_overlay_detection')
model.eval()

# Model labels: [clean_single, double, group, over_the_shoulder, insert, establishing]
image = Image.open('overlay.png')
inputs = image_processor(image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

predicted_label = outputs.logits.argmax(-1).item()
print(model.config.id2label[predicted_label])

Model Performance

Class Precision Recall F1-score
no_text_overlay 0.97 0.99 0.98
text_overlay 0.99 0.97 0.98
Downloads last month
166
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support