Text Overlay Detection
Text overlays are widely used for subtitles, credits, watermarks, promotional messages, and explanatory labels. There are many use cases for which we may want to detect and/or remove text overlay โ avoiding burn-in text when training image and video generation models, supplying clean content for ad creatives, removing burn-in text from diffing algorithms, and creating paired data for title treatment and other text generation tasks.
This model was trained on 2k pairs of data sampled using a VLM as a weakly supervised classifier. The 2k data was then manually annotated. The published model uses DinoV2 w/ Regsiters backbone and a modified preprocessor in order to remove center cropping (text overlays are often in the corners of images!).
How To Use
import torch
from PIL import Image
from transformers import AutoImageProcessor
from transformers import AutoModelForImageClassification
image_processor = AutoImageProcessor.from_pretrained("aslakey/text_overlay_detection")
model = AutoModelForImageClassification.from_pretrained('aslakey/text_overlay_detection')
model.eval()
# Model labels: [clean_single, double, group, over_the_shoulder, insert, establishing]
image = Image.open('overlay.png')
inputs = image_processor(image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predicted_label = outputs.logits.argmax(-1).item()
print(model.config.id2label[predicted_label])
Model Performance
| Class | Precision | Recall | F1-score |
|---|---|---|---|
| no_text_overlay | 0.97 | 0.99 | 0.98 |
| text_overlay | 0.99 | 0.97 | 0.98 |
- Downloads last month
- 166