Text Overlay Detection

Text overlays are widely used for subtitles, credits, watermarks, promotional messages, and explanatory labels. There are many use cases for which we may want to detect and/or remove text overlay – avoiding burn-in text when training image and video generation models, supplying clean content for ad creatives, removing burn-in text from diffing algorithms, and creating paired data for title treatment and other text generation tasks.

This model was trained on 2k pairs of data sampled using a VLM as a weakly supervised classifier. The 2k data was then manually annotated. The published model uses DinoV2 w/ Regsiters backbone and a modified preprocessor in order to remove center cropping (text overlays are often in the corners of images!).

How To Use

import torch
from PIL import Image
from transformers import AutoImageProcessor
from transformers import AutoModelForImageClassification

image_processor = AutoImageProcessor.from_pretrained("aslakey/text_overlay_detection")
model = AutoModelForImageClassification.from_pretrained('aslakey/text_overlay_detection')
model.eval()

# Model labels: [clean_single, double, group, over_the_shoulder, insert, establishing]
image = Image.open('overlay.png')
inputs = image_processor(image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

predicted_label = outputs.logits.argmax(-1).item()
print(model.config.id2label[predicted_label])

Model Performance

Class	Precision	Recall	F1-score
no_text_overlay	0.97	0.99	0.98
text_overlay	0.99	0.97	0.98

Downloads last month: 166

Safetensors

Model size

0.3B params

Tensor type

F32