惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

T
Tenable Blog
Last Week in AI
Last Week in AI
P
Proofpoint News Feed
Engineering at Meta
Engineering at Meta
H
Help Net Security
F
Fortinet All Blogs
MyScale Blog
MyScale Blog
宝玉的分享
宝玉的分享
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
博客园 - 司徒正美
量子位
N
Netflix TechBlog - Medium
Apple Machine Learning Research
Apple Machine Learning Research
小众软件
小众软件
Recorded Future
Recorded Future
博客园 - 三生石上(FineUI控件)
Vercel News
Vercel News
aimingoo的专栏
aimingoo的专栏
I
InfoQ
Microsoft Security Blog
Microsoft Security Blog
Scott Helme
Scott Helme
The Last Watchdog
The Last Watchdog
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
IT之家
IT之家
AI
AI
WordPress大学
WordPress大学
Security Archives - TechRepublic
Security Archives - TechRepublic
Google Online Security Blog
Google Online Security Blog
U
Unit 42
V2EX - 技术
V2EX - 技术
MongoDB | Blog
MongoDB | Blog
Schneier on Security
Schneier on Security
博客园 - Franky
H
Heimdal Security Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Jina AI
Jina AI
W
WeLiveSecurity
P
Privacy & Cybersecurity Law Blog
Cloudbric
Cloudbric
B
Blog RSS Feed
N
News | PayPal Newsroom
S
Securelist
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
I
Intezer
Hacker News - Newest:
Hacker News - Newest: "LLM"
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
博客园_首页
罗磊的独立博客
H
Hackread – Cybersecurity News, Data Breaches, AI and More
雷峰网
雷峰网

LessWrong

CLR's Safe Pareto Improvements Research Agenda — LessWrong My Last 7 Blog Posts: a weekly round-up — LessWrong Quality Matters Most When Stakes are Highest — LessWrong If a room feels off the lighting is probably too "spiky" or too blue — LessWrong Stop AI Now — LessWrong The "Budgeting" Skill Has The Most Betweenness Centrality (Probably) — LessWrong Stupid Minutes Reevaluating "AGI Ruin: A List of Lethalities" in 2026 Who I Follow What's the LessWrongist philosophy of mathematics? MixedHTML Mode for Emacs Stop AI Resources for starting and growing an AI safety org There are only four skills: design, technical, management and physical Fifteen Years Aboard Arguments Should Be Decisive Criticisms — LessWrong The map is part of the territory — LessWrong “Best humans still outperform”: One turning point in the history of cope around artificial intelligence — LessWrong Society is a social construct, pace Arrow — LessWrong Consent-Based RL: Letting Models Endorse Their Own Training Updates — LessWrong AI #164: Pre Opus — LessWrong Publish-first writing — LessWrong What does status signalling do? When successful, what does it achieve? — LessWrong Let goodness conquer all that it can defend — LessWrong Why I'm Less of a Shill for Related Work Sections — LessWrong From Artificial Intelligence to an ecosystem of artificial life-forms. — LessWrong If You've Never Bought a Tool You Didn't Need, You're Not Buying Enough Tools — LessWrong Verify, but Trust — LessWrong Taking political violence seriously — LessWrong Against Doom & Pause AI — LessWrong Come to Manifest 2026! (June 12-14) — LessWrong How Big Tech Becomes Ungovernable — LessWrong Attempting to Quantify Chinese Bias in Open-Source LLMs — LessWrong A Research Bet on SAE-like Expert Architectures — LessWrong Church Planting: Lessons from the Comments — LessWrong On Dwarkesh Patel’s Podcast With Nvidia CEO Jensen Huang — LessWrong Anthropic Releases Opus 4.7 — LessWrong Specialization is a Driver of Natural Ontology — LessWrong You can only build safe ASI if ASI is globally banned — LessWrong Laptop stands are a thing your neck may appreciate — LessWrong Simulated Qualia Mugging — LessWrong You Aren't in Charge of the Overton Window; Politics Is Not Interior Design — LessWrong Post-Scarcity is bullshit — LessWrong Two Examples of Joy in the Seemingly Mundane — LessWrong How to run from a bull — LessWrong Carpathia Day — LessWrong Do not conquer what you cannot defend — LessWrong What economists get wrong (and sometimes right!) about AI — LessWrong Reflections of a Wordcel — LessWrong MAISU 2026 - Minimal AI Safety Unconference (April 24-27, online) — LessWrong Not a Goal. A Goal-like behavior. — LessWrong A visualization of changing AGI timelines, 2023 - 2026 — LessWrong What is the Iliad Intensive? — LessWrong LLM-tier personal computer security — LessWrong Beware of Well-Written Posts — LessWrong The Mirror Test Is Complicated — LessWrong Political Violence Is Never Acceptable — LessWrong AI Safety's Biggest Talent Gap Isn't Researchers. It's Generalists. — LessWrong Clique, Guild, Cult — LessWrong Your body is not a white box (and you're thinking about weight loss wrong) — LessWrong Counterintuitive Coin Toss. Part II — LessWrong An Ode to Humility and Curiosity in the New Machine Era [Hot take] Problems with AI prose You can’t trust violence — LessWrong The Blast Radius Principle — LessWrong On not being scared of math — LessWrong Why I'm excited about meta-models for interpretability — LessWrong The Ethics of AI-Assisted Creative Work — LessWrong How to make good tea — LessWrong Searchable explorer of EA Forum & LessWrong posts with explicit cruxes or "change my mind" content — LessWrong Constitutional AI vs. RLHF vs. Deliberative Alignment — LessWrong Eating meat is fine if you live in a simulation — LessWrong Tactics for Denying Your Motivations, or Why Legibility is Expensive — LessWrong Spectra of LSRDRs of the Okubo algebra — LessWrong Your Mom is a Chimera — LessWrong An apple picking model for AI R&D — LessWrong Dreams of the Future — LessWrong Pausing AI Is the Best Answer to Post-Alignment Problems — LessWrong Quick Thoughts About Mythos — LessWrong A permitted value of resting — LessWrong Scott Alexander gentrified my meetup — LessWrong Claude Interviews Me About Writing — LessWrong Catching illicit distributed training operations during an AI pause — LessWrong Proof Explained: Touchette-Lloyd Theorem — LessWrong 10% ≈ 90% — LessWrong Anthropic Shadow Realm (working notes) — LessWrong the Lazy Market Hypothesis — LessWrong Announcing ILIADIII: AENEID — LessWrong Have we already lost? Part 3: Reasons for Optimism — LessWrong Dario probably doesn't believe in superintelligence — LessWrong Why Nothing Ever Happens — LessWrong Could a single rogue AI destroy humanity? — LessWrong Hi. I am hbj. — LessWrong Getting Claude to rank the inkhaven bloggers — LessWrong Some thoughts on Nectome's risk and resilience — LessWrong The median take is taken — LessWrong If Mythos actually made Anthropic employees 4x more productive, I would radically shorten my timelines — LessWrong Biological Computing Underhang — LessWrong Claude Mythos #2: Cybersecurity and Project Glasswing — LessWrong The Unintelligibility is Ours: Notes on Chain-of-Thought — LessWrong
Summarizing and Reviewing my earliest ML research paper, 7 years later
LawrenceC · 2026-04-19 · via LessWrong
Written very quickly for the Inkhaven Residency. Yesterday, I started a post-mortem on my earliest published machine learning research paper “The Assistive Multi-Armed Bandit” , by providing the context for my paper as well as the timeline of the project. Today, I’ll summarize and review the actual paper. Paper summary As noted in the abstract and introduction, a commonly-studied problem in many fields is preference learning : inferring what people want by observing their behavior. For both philosophical and practical reasons, it’s pretty common to infer preferences by assuming that people are (close to) rational. The paper tries to formalize a particular way in which this rationality assumption fails: people often do not know what their preferences are ahead of time, and observations of behavior often occur when people are still learning what they want. The formalism studied in the paper is probably the simplest version of an assistance game where a human is learning about their rewards. In this setup, there are N possible actions, each with their own independent distribution of rewards. At each step, the human suggests an action, the robot takes a (possibly different) action, and then the human gets to observe the results of each action. (In section V.E we consider alternative setups – one where the human and robot alternate turns, and the other in which the robot can choose between acting themselves and allowing the human to act). As the value of the actions stays constant over time, this is a multi-armed bandit setting. Caption: Figure 1 from the assistive bandits paper, which introduces the formalism. As an aside, “A” was used for the robot (“agent”, “actor”, or “AI”) instead of “R”, to avoid confusing the robot and its actions with the reward observed by the human r . We then prove some theoretical results about this setting. First, proposition 1 is a rather trivial proposition, that basically says that if the human knows the expected value of each action almost always suggests the best action, then the robot can do quite well (achieve finite regret) by just taking the action the human suggests the most often. I see this as an interesting exercise mainly to check that we understand the setting: inferring the optimal action is easy when the person almost always suggests the optimal action. The difficulty of the assistive bandits framing comes entirely from the fact that the human is still learning about their preferences. Proposition 2 shows that if the human implements a policy where they take the action with the highest observed average value most often, then with robot assistance, the pair will eventually converge to taking the action with the highest frequency (that is, the policy is consistent). This is done by making the robot explore for the human at a carefully chosen rate, such that each action is taken infinite times, but where eventually the probability of exploring at each given time step decays to 0. This is a nice implementation of ideas from multi-armed bandit theory. Proposition 3 follows somewhat non-trivially from proposition 1 (consistency is weaker than finite regret). But it makes sense that I’d state it separately, since the proof is much easier. Proposition 4 is the main result of the paper: it is indeed important to model learning, in that there are situations where not doing so leads to bad outcomes even in the limit, because the robot that doesn’t respect the human’s learning process may prevent the human from learning about their preferences. Proposition 5 shows that there are simple human policies that are informative, that do better than “optimal” policies. That is, you can be better in collaborative games by informing others of your preferences, rather than by doing the best thing in isolation. The paper claims that this is “surprising”, but I felt this was pretty obvious. Proposition 6 and corollary 7 feels the most proof-of-worky to me. They basically expand on proposition 5, by arguing that in this setting (via some somewhat intricate math), if the robot is to assist the human, the human must communicate sufficient information about the optimal arm. The paper then uses deep reinforcement learning + recurrent neural networks to solve for robot policies in specific versions of the assistive bandit problem. Using this technique, the paper basically confirms all of the theoretical results in their experimental setting. Finally, in V.E, the paper considers other variants of the assistive bandit problem. The paper concludes with obvious limitations and future work: the human policies are unrealistic, as is the simplicity of the environmental setup (that is, there is no environment). Feedback on the paper It’s hard not to be too critical toward my past self, but I’ll make an honest attempt to evaluate the paper. I think, as a “first paper”, it was pretty okay but not brilliant. I still think that the core idea behind the paper is good: it is indeed quite important to model human suboptimality when trying to infer preferences from observed behavior. I also think the specific form of suboptimality studied in the paper – that people often can only access their preferences by experiencing them, rather than having full introspective access to their preferences at the start – is obviously real. The paper also shows that, at the time, I had a decent level of familiarity with both multi-armed bandit theory and deep reinforcement learning. In this sense, the paper does indeed study the simplest setting with this form of irrationality, where all of the complexity of the setting comes from the fact the human is still learning about their preferences. I worry that, by studying such a restrictive setting, the only conclusions I also think, with the exception of some nitpicks, the writing of the paper is quite good: every proposition is set up and explained. I also like the parallel structure between the theory and the experiments. The graphic design was also quite consistently clean. The main issue I have with the paper is that the setting may be so simplified as to be uninteresting. Specifically, in my opinion, the main contributions of the paper (if anything) are proposition 4 – when the human is learning about their preferences, it’s important to model the learning – and arguably the use of deep RL to solve the complex POMDP. I feel like most of the other parts of the paper weren’t great, and insofar as they provided value they were proof of competency/proof of work so that these contributions would be taken seriously. A related problem is that, even if we were to accept that people are learning about their preferences, it’s not at all obvious how to incorporate it into actual algorithms. Notably, the paper completely sidesteps this by using black-box optimization to solve the setting. The paper would be a lot stronger if the other theorems and a bunch of the space dedicated to experiments were replaced with a simple application of these ideas with real human data (even if in a toy setting). This would also help address the question of how to actually use this insight in practice. As is, my evaluation of the paper would depend a lot on the novelty of the main contribution (“it can be important to model human learning”) at the time it was written. Based on my recollection, at least in the circles I was writing for, this was a novel insight. But I also think it was not that useful (modelling human learning explicitly is hard, so much so that Nits/Small Improvements From reading the paper, here are some more minor issues that nonetheless stood out to me: A serious issue I have with the way the paper is framed is that the paper makes no reference to CIRL, which it was clearly (and actually) inspired by. I think this was an overreaction to the reviewers from ICLR 2018, and I’d probably suggest rewriting it to at least mention it. I also don’t like that the paper doesn’t have a related work section, even though it does cite most of the related work I’m aware of. I think this makes the paper’s contributions less clear. (My guess is this was cut for space?) I really dislike the garden path sentence that starts the abstract: “Learning preferences implicit in the choices humans make is a well studied problem in both economics and computer science.“ Today, I’d probably write this exact sentence as either “Inferring preferences from observed human behavior is a well-studied problem in economics and computer science.” or “A well-studied problem in both economics and computer science is learning preferences by observing human behavior.” The phrase “Learning preferences implicit in the choices humans make” is quite hard to parse on a first read. If I were back in 2018 – before the recent rise in LLM writing made en- and em-dashes a sign of slop, and back when I was a huge fan of using both dashes in my own writing – I’d probably suggest “Preference learning — inferring what people want from the choices they make — is a well-studied problem both economics and computer science.” This avoids the garden path-y structure of the current while still placing the topic of the paper (“preference learning”) at the front of the abstract. I really dislike the overuse of italics (\emph{}) for emphasis, especially in the introduction. Using a few italics for emphasis is good, and it’s fine to use them for introducing new terms (if done so consistently), but using it several times per paragraph lessens the impact of the italics. I’d probably A nitpick: I mistakenly used hyphens in place of en- or em-dashes several times in the paper. (Adam Gleave told me about this mistake in 2019). It’s not obvious from the paper that the policies learned by the RNNs are actually optimal – I wish I included more of the results where I verified the optimality of the policies with classic POMDP solvers. Discuss