VQASynth preference dataset: https://huggingface.co/datasets/remyxai/mhpd-dpo-v0
Salma Mayorquin PRO
salma-remyx
AI & ML interests
None yet
Recent Activity
liked a dataset about 5 hours ago
remyxai/mhpd-dpo-v0 repliedto their post about 5 hours ago
The space of possible improvements for your AI model is large while evaluation is costly.
So I was excited to discover the ICML 2026 paper from Kobalczyk, Lin, Letham, Zhao, Balandat, and Bakshy titled "LILO: Bayesian Optimization with Natural Language Feedback."
The method learns efficiently from expert preferences, balancing exploration and exploitation in a principled way with Bayesian Optimization for expensive-to-evaluate black-box objectives.
Experimenting with the technique, I trained a Gaussian Process proxy model on the implicit preferences in my code repo's commit history at VQASynth.
The result: I used the model's preference scores to re-rank candidate papers recommended based on my interests in spatial reasoning and multimodal data synthesis.
Semantic relevance is a high-recall method for finding arXiv papers personalized to your interests. Adding contributor preferences, extracted from the merge history of your code offers a high-precision filter.
So what's next? I'm using the model to synthesize a larger volume of preference data to finetune an open-weight coding model with DPO and LoRA. Tuning Coding Agents via Implicit Preference Distillation
arXiv: https://arxiv.org/pdf/2510.17671
Substack: https://remyxai.substack.com/p/lilo-and-myx
VQASynth: https://github.com/remyxai/VQASynth updated a dataset about 7 hours ago
remyxai/mhpd-dpo-v0Organizations
replied to their post about 5 hours ago
posted an update about 9 hours ago
Post
2350
The space of possible improvements for your AI model is large while evaluation is costly.
So I was excited to discover the ICML 2026 paper from Kobalczyk, Lin, Letham, Zhao, Balandat, and Bakshy titled "LILO: Bayesian Optimization with Natural Language Feedback."
The method learns efficiently from expert preferences, balancing exploration and exploitation in a principled way with Bayesian Optimization for expensive-to-evaluate black-box objectives.
Experimenting with the technique, I trained a Gaussian Process proxy model on the implicit preferences in my code repo's commit history at VQASynth.
The result: I used the model's preference scores to re-rank candidate papers recommended based on my interests in spatial reasoning and multimodal data synthesis.
Semantic relevance is a high-recall method for finding arXiv papers personalized to your interests. Adding contributor preferences, extracted from the merge history of your code offers a high-precision filter.
So what's next? I'm using the model to synthesize a larger volume of preference data to finetune an open-weight coding model with DPO and LoRA. Tuning Coding Agents via Implicit Preference Distillation
arXiv: https://arxiv.org/pdf/2510.17671
Substack: https://remyxai.substack.com/p/lilo-and-myx
VQASynth: https://github.com/remyxai/VQASynth
So I was excited to discover the ICML 2026 paper from Kobalczyk, Lin, Letham, Zhao, Balandat, and Bakshy titled "LILO: Bayesian Optimization with Natural Language Feedback."
The method learns efficiently from expert preferences, balancing exploration and exploitation in a principled way with Bayesian Optimization for expensive-to-evaluate black-box objectives.
Experimenting with the technique, I trained a Gaussian Process proxy model on the implicit preferences in my code repo's commit history at VQASynth.
The result: I used the model's preference scores to re-rank candidate papers recommended based on my interests in spatial reasoning and multimodal data synthesis.
Semantic relevance is a high-recall method for finding arXiv papers personalized to your interests. Adding contributor preferences, extracted from the merge history of your code offers a high-precision filter.
So what's next? I'm using the model to synthesize a larger volume of preference data to finetune an open-weight coding model with DPO and LoRA. Tuning Coding Agents via Implicit Preference Distillation
arXiv: https://arxiv.org/pdf/2510.17671
Substack: https://remyxai.substack.com/p/lilo-and-myx
VQASynth: https://github.com/remyxai/VQASynth
posted an update 11 days ago
Post
3533
VQASynth is the open source implementation of the SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning
Capabilities (2401.12168) paper, putting together the data synthesis pipeline behind remyxai/SpaceQwen2.5-VL-3B-Instruct, remyxai/SpaceThinker-Qwen2.5VL-3B, and several other spatial reasoning models we've shared here on HF.
From early development through production, different categories of evidence become available to guide what to try next. The strongest decisions combine evidence across categories rather than relying on any one.
Stage 1: Development history
Commit history holds the moments where things changed. For VQASynth, that's how scenes get parsed, how captions get generated, how spatial relations get encoded. Even before a model is in production, those milestones are a strong signal for what methods are semantically relevant to where the system is now.
Stage 2: Observational outcomes
Once a model is serving, the same commit history delineates changes against real-world results. That opens up quasi-experiments. You get causal evidence about which changes drove which outcomes, and inference on questions you haven't directly tested.
Stage 3: Controlled experiments
When teams start running interventions, those outcomes tighten the estimates further. This is the regime most people associate with rigor, but it's expensive and gated by traffic.
Stage 4: Counterfactual perturbations
When A/B testing becomes the operational bottleneck, instrumenting decision points in the production system lets you probe what would have happened under alternative choices. Shadow mode first, live traffic once audits pass.
Experimentation maturity is a journey, and every stage offers something to learn from.
More on these ideas: https://docs.remyx.ai/concepts/maturity-progression
From early development through production, different categories of evidence become available to guide what to try next. The strongest decisions combine evidence across categories rather than relying on any one.
Stage 1: Development history
Commit history holds the moments where things changed. For VQASynth, that's how scenes get parsed, how captions get generated, how spatial relations get encoded. Even before a model is in production, those milestones are a strong signal for what methods are semantically relevant to where the system is now.
Stage 2: Observational outcomes
Once a model is serving, the same commit history delineates changes against real-world results. That opens up quasi-experiments. You get causal evidence about which changes drove which outcomes, and inference on questions you haven't directly tested.
Stage 3: Controlled experiments
When teams start running interventions, those outcomes tighten the estimates further. This is the regime most people associate with rigor, but it's expensive and gated by traffic.
Stage 4: Counterfactual perturbations
When A/B testing becomes the operational bottleneck, instrumenting decision points in the production system lets you probe what would have happened under alternative choices. Shadow mode first, live traffic once audits pass.
Experimentation maturity is a journey, and every stage offers something to learn from.
More on these ideas: https://docs.remyx.ai/concepts/maturity-progression
replied to their post 12 days ago
Read more about this on Substack: https://remyxai.substack.com/p/the-bottleneck-upstream
posted an update 15 days ago
Post
5097
SciCrafter measured something AI practitioners have intuited: frontier agents are improving at executing inside well-framed problems, but lag at framing the problem in the first place.
GPT-5.2, Gemini-3-Pro, and Claude Opus 4.5 all plateaued near 26% on a new Minecraft benchmark for probing AI capabilities in the discovery-to-application loop.
So the authors ran targeted interventions:
* Hints about what to investigate doubled performance.
* A structured experimentation template added 7-14 more points.
* Structured consolidation beat free-form summaries by 6 points.
* Curriculum context beat independent task-solving.
These interventions helped the agent frame whatโs worth investigating, and structure what gets learned so it compounds. The bottleneck for AI in scientific workflows is upstream of execution.
Their findings are congruent with the design patterns we've adopted at Remyx AI to help AI teams close the development loop scientifically.
Agents work well inside structured loops, but they perform poorly when tasked with creating the structure. Instrumenting your scientific workflows offers greater leverage than scaling compute with a less informed search.
In the work of building production AI systems, teams are flying through execution. The bigger challenge is identifying which experiments moved which production outcome, or what to try next.
One of the more interesting results I found this week by tracking work in AI for scientific workflows using Remyx: https://engine.remyx.ai/papers/d8f23b9b-b14b-4ada-b44e-ccfc221c06b4
GPT-5.2, Gemini-3-Pro, and Claude Opus 4.5 all plateaued near 26% on a new Minecraft benchmark for probing AI capabilities in the discovery-to-application loop.
So the authors ran targeted interventions:
* Hints about what to investigate doubled performance.
* A structured experimentation template added 7-14 more points.
* Structured consolidation beat free-form summaries by 6 points.
* Curriculum context beat independent task-solving.
These interventions helped the agent frame whatโs worth investigating, and structure what gets learned so it compounds. The bottleneck for AI in scientific workflows is upstream of execution.
Their findings are congruent with the design patterns we've adopted at Remyx AI to help AI teams close the development loop scientifically.
Agents work well inside structured loops, but they perform poorly when tasked with creating the structure. Instrumenting your scientific workflows offers greater leverage than scaling compute with a less informed search.
In the work of building production AI systems, teams are flying through execution. The bigger challenge is identifying which experiments moved which production outcome, or what to try next.
One of the more interesting results I found this week by tracking work in AI for scientific workflows using Remyx: https://engine.remyx.ai/papers/d8f23b9b-b14b-4ada-b44e-ccfc221c06b4
Post
2184
Some ask how we can recommend recent advances for improving your AI system.
We tell them "The code is the context."
Here's a demo showing how to get started with paper recommendations by connecting to a repo URL.
Your repo history describes what you've tried and what you're working on, so we ground suggested ideas in your actual development trajectory from day one.
What the loop looks like end to end:
* New ideas are sourced arXiv papers and GitHub repos, ranked by relevance your codebase.
* With a click, you get implementations scaffolded as a feature branches.
* Validation jobs provision compute on Modal so you can measure the change against your baseline.
* Results are synced across the tools your team already uses.
Try it on your repo: https://engine.remyx.ai
We tell them "The code is the context."
Here's a demo showing how to get started with paper recommendations by connecting to a repo URL.
Your repo history describes what you've tried and what you're working on, so we ground suggested ideas in your actual development trajectory from day one.
What the loop looks like end to end:
* New ideas are sourced arXiv papers and GitHub repos, ranked by relevance your codebase.
* With a click, you get implementations scaffolded as a feature branches.
* Validation jobs provision compute on Modal so you can measure the change against your baseline.
* Results are synced across the tools your team already uses.
Try it on your repo: https://engine.remyx.ai
posted an update 30 days ago
Post
2184
Some ask how we can recommend recent advances for improving your AI system.
We tell them "The code is the context."
Here's a demo showing how to get started with paper recommendations by connecting to a repo URL.
Your repo history describes what you've tried and what you're working on, so we ground suggested ideas in your actual development trajectory from day one.
What the loop looks like end to end:
* New ideas are sourced arXiv papers and GitHub repos, ranked by relevance your codebase.
* With a click, you get implementations scaffolded as a feature branches.
* Validation jobs provision compute on Modal so you can measure the change against your baseline.
* Results are synced across the tools your team already uses.
Try it on your repo: https://engine.remyx.ai
We tell them "The code is the context."
Here's a demo showing how to get started with paper recommendations by connecting to a repo URL.
Your repo history describes what you've tried and what you're working on, so we ground suggested ideas in your actual development trajectory from day one.
What the loop looks like end to end:
* New ideas are sourced arXiv papers and GitHub repos, ranked by relevance your codebase.
* With a click, you get implementations scaffolded as a feature branches.
* Validation jobs provision compute on Modal so you can measure the change against your baseline.
* Results are synced across the tools your team already uses.
Try it on your repo: https://engine.remyx.ai
posted an update about 1 month ago
Post
840
AI is a scientific discipline. So it can't help that you're context-switching between tools and wrangling scattered data just to run a single experiment. Tickets in one place, branches in another, evals on whatever infra you stood up last time, metrics somewhere else.
Remyx offers one unified experiment view, capturing everything from hypothesis to decision, with every step instrumented and every decision preserved. Every experiment becomes context for the next one.
Remyx offers one unified experiment view, capturing everything from hypothesis to decision, with every step instrumented and every decision preserved. Every experiment becomes context for the next one.
replied to their post about 1 month ago
docs are live! try you own experiment: https://docs.remyx.ai
posted an update about 1 month ago
Post
161
We've been building Remyx to help AI teams track what's actually working across their AI development efforts.
Every experiment you and your team runs, from where the approach came from, through implementation and testing, to whether it moved the metric you care about is tracked in one place. Over time, Remyx spots patterns across your experiments and recommends new approaches worth testing based on what's proven to work.
It connects with the tools you already use (GitHub, Linear, Claude Code, HuggingFace) so experiment context doesn't get lost across six different places.
Full demo vid here: https://youtu.be/XscVmkxTACA
The free dev version is live at https://remyx.ai!
We're looking for feedback from teams actively developing AI applications. If you give it a look, would love to hear what's missing or what would make it more useful for your workflow.
Every experiment you and your team runs, from where the approach came from, through implementation and testing, to whether it moved the metric you care about is tracked in one place. Over time, Remyx spots patterns across your experiments and recommends new approaches worth testing based on what's proven to work.
It connects with the tools you already use (GitHub, Linear, Claude Code, HuggingFace) so experiment context doesn't get lost across six different places.
Full demo vid here: https://youtu.be/XscVmkxTACA
The free dev version is live at https://remyx.ai!
We're looking for feedback from teams actively developing AI applications. If you give it a look, would love to hear what's missing or what would make it more useful for your workflow.
posted an update about 2 months ago
Post
2274
Every change tested into your AI creates evidence over the space of possible improvements
With these insights, we can match new methods to the problems you're facing in your application
Check it out at Remyx: https://remyx.ai
With these insights, we can match new methods to the problems you're facing in your application
Check it out at Remyx: https://remyx.ai
posted an update about 2 months ago
Post
1456
How do you find ideas to try next?
I'm tracking multiple topics tied to the projects we're building at Remyx. Every morning I get a feed of papers ranked by relevance to those topics.
No more good ideas lost because they didn't trend on X.
Build your own feed for free: https://engine.remyx.ai
Read more: https://docs.remyx.ai/resources/explore
I'm tracking multiple topics tied to the projects we're building at Remyx. Every morning I get a feed of papers ranked by relevance to those topics.
No more good ideas lost because they didn't trend on X.
Build your own feed for free: https://engine.remyx.ai
Read more: https://docs.remyx.ai/resources/explore
posted an update about 2 months ago
Post
3515
We built an OpenClaw ๐ฆ skill that sends daily ranked research recommendations to Slack using the Remyx AI CLI.
You define Research Interests (topics, HF models, GitHub repos, blogs etc), Remyx ranks new arXiv papers and repos to find the most relevant resources, and an OpenClaw cron job posts a formatted digest to your team's #research channel every weekday morning.
The tutorial covers the full setup end-to-end: installing the CLI, creating interests, connecting OpenClaw to Slack, installing the Remyx skill, and scheduling the cron. About 15 minutes start to finish.
Tutorial: https://docs.remyx.ai/tutorials/daily-research-digest-slack
Would love to hear how folks are tracking research for their projects. If you give this a try, let us know what you think!
You define Research Interests (topics, HF models, GitHub repos, blogs etc), Remyx ranks new arXiv papers and repos to find the most relevant resources, and an OpenClaw cron job posts a formatted digest to your team's #research channel every weekday morning.
The tutorial covers the full setup end-to-end: installing the CLI, creating interests, connecting OpenClaw to Slack, installing the Remyx skill, and scheduling the cron. About 15 minutes start to finish.
Tutorial: https://docs.remyx.ai/tutorials/daily-research-digest-slack
Would love to hear how folks are tracking research for their projects. If you give this a try, let us know what you think!
replied to their post about 2 months ago
posted an update about 2 months ago
Post
3828
Looking to execute on your next great idea? ๐ก
Search for relevant papers and find pre-built Docker images to interactively explore the code with Remyx!
Check out the new space ๐
remyxai/remyx-explorer
Search for relevant papers and find pre-built Docker images to interactively explore the code with Remyx!
Check out the new space ๐
remyxai/remyx-explorer
posted an update 7 months ago
Post
3354
We've built over 10K containerized reproductions of papers from arXiv!
Instead of spending all day trying to build an environment to test that new idea, just pull the Docker container from the Remyx registry.
And with Remyx, you can start experimenting faster by generating a test PR in your codebase based on the ideas found in your paper of choice.
Hub: https://hub.docker.com/u/remyxai
Remyx docs: https://docs.remyx.ai/resources/ideate
Coming soon, explore reproduced papers with AG2 + Remyx: https://github.com/ag2ai/ag2/pull/2141
Instead of spending all day trying to build an environment to test that new idea, just pull the Docker container from the Remyx registry.
And with Remyx, you can start experimenting faster by generating a test PR in your codebase based on the ideas found in your paper of choice.
Hub: https://hub.docker.com/u/remyxai
Remyx docs: https://docs.remyx.ai/resources/ideate
Coming soon, explore reproduced papers with AG2 + Remyx: https://github.com/ag2ai/ag2/pull/2141
posted an update 7 months ago
Post
1065
The future is arriving too fast not to use programmatic discovery and replication.
Search arXiv โ Execute in 30 seconds with pre-built Docker environments
Check out our latest integration with AG2 to accelerate your discovery loop.
As easy as:
Tutorial: https://github.com/ag2ai/ag2/blob/4c6954e3959fe672980191f264e30d451bc23554/notebook/agentchat_remyx_executor.ipynb
PR: https://github.com/ag2ai/ag2/pull/2141
Search arXiv โ Execute in 30 seconds with pre-built Docker environments
Check out our latest integration with AG2 to accelerate your discovery loop.
As easy as:
from remyxai.client.search import SearchClient
from autogen.coding import RemyxCodeExecutor
# Search by topic
papers = SearchClient().search(
"data synthesis strategies",
has_docker=True, # Only papers with pre-built environments
limit=10
)
executor = RemyxCodeExecutor(arxiv_id=papers[0].arxiv_id)
remyx_executor.explore(
goal="Run a test with my model remyxai/SpaceThinker-Qwen2.5VL-3B",
interactive=False # Automated exploration
)Tutorial: https://github.com/ag2ai/ag2/blob/4c6954e3959fe672980191f264e30d451bc23554/notebook/agentchat_remyx_executor.ipynb
PR: https://github.com/ag2ai/ag2/pull/2141
posted an update 8 months ago
Post
3724
Thanks again to @ag2 for hosting us at their Community Talks!
@terry-remyx walked us through a technical deep dive into GitRank, our automated pipeline that converts research papers with code into containerized, executable environments and generates specialized tests tailored to users' specific codebases.
In case you missed it...
Full recording: https://www.youtube.com/watch?v=N_FNfZ71s2I
Deck: https://docs.google.com/presentation/d/1S0q-wGCu2dliVWb9ykGKFz61jZKZI4ipxWBv73HOFBo/edit?usp=sharing
@terry-remyx walked us through a technical deep dive into GitRank, our automated pipeline that converts research papers with code into containerized, executable environments and generates specialized tests tailored to users' specific codebases.
In case you missed it...
Full recording: https://www.youtube.com/watch?v=N_FNfZ71s2I
Deck: https://docs.google.com/presentation/d/1S0q-wGCu2dliVWb9ykGKFz61jZKZI4ipxWBv73HOFBo/edit?usp=sharing
posted an update 8 months ago
Post
2930
We're joining the @ag2 team in discord to present a deep-dive into how we've used the framework to build GitRank in their Community Talks
The GitRank pipeline is used to:
๐ฐ power personalized paper recommendations
๐ณ build environments as Docker Images
๐ฏ implement core-methods as PRs for your target repo
Don't miss it! Tomorrow, Sept 25 at 9:00 am PST: https://calendar.app.google/3soCpuHupRr96UaF8
The GitRank pipeline is used to:
๐ฐ power personalized paper recommendations
๐ณ build environments as Docker Images
๐ฏ implement core-methods as PRs for your target repo
Don't miss it! Tomorrow, Sept 25 at 9:00 am PST: https://calendar.app.google/3soCpuHupRr96UaF8
posted an update 8 months ago
Post
1512
We've added intelligent full-text search across our pre-built Docker images for arXiv papers with ready-to-run code and papers straight from arXiv.
Natural language queries.
Semantic understanding.
One search to find both the paper AND the runnable code.
Try it today: https://engine.remyx.ai/resources/
Join us at Experiment 2025: https://experiment.remyx.ai
Natural language queries.
Semantic understanding.
One search to find both the paper AND the runnable code.
Try it today: https://engine.remyx.ai/resources/
Join us at Experiment 2025: https://experiment.remyx.ai
