





















AI still can’t act in the physical world — but it can see, reason, and guide the human who does. Imagine wearing a small camera while an AI sees what you see and talks you through the job in real time: “turn off that valve,” “use the ⅜-inch wrench,” “that part looks worn — replace it.” Instead of months or years of training, workers can become effective immediately with AI coaching and on-demand access to new skills.
Think of coding agents, but for physical work guidance. You have experts write skill folders (for something like HVAC repair) and allow users to use these skill for doing tasks.
I have already built this and am waiting for appstore approval right now. I have been able to do origami, cooking, mic assembly - sort of (its not smooth enough yet, and also doesnt describe things well enough. saying 3/8th wrench is easy but how do you say fold the paper along the x=y line without sounding confusing and mathematical. models just arent built to describe such things and it shows.)
I am also facing the following problems - agents are slow. The typical plan -> do -> correct -> replan cycle is bottlenecked at the 'do' stage because as opposed to the agent being able to do the task, the human has to do it. But also in general, AGENTS ARE SLOW. thinking is faster in humans, its one shot and at the same time also well thought out in parallel - when needed. I want to figure a way so both thinkings can sort of connect from time to time otherwise it just seems out of sync. + building a gate for when longer thinking should correct the answer given by shorter thinking. - verification is harder than in agents. code pretty much spits out true or false but here is more of a 0 to 1 but its a lot less deterministic in physical tasks. I have been thinking how much of a value add adding more sensors (heat, etc) would be for the product. I am guessing its also very task dependent. - static vs pov camera: Phone vs meta glasses - surprisingly phone has its own advantages (fov is the same so its easier to distinguish between frames, and you can also capture more of the workspace since its not super upclose like your eyes would be. - humans seems to have a 3d representation of the worspace in their mind even though we may not be able to see whats behind us there is something in our heads (not sure how its stored) which tell us that the hammer is behind me and there is a nail box roughly behind my foot or so. Additionally we can also feel stuff not in our view before identifiying it, this isnt being modelled into the agent. - we think while we talk, we think while the other person talks, we also think when noone talks - agents only see to do the 3rd thing. maybe I can do 2 as well, but doing 1 is tricky.
Either way... how does this idea sound? 1. Is it better than watching a youtube tutorial, and is it a significant value add over sending a pic to chatgpt and asking what to do? For coding it seems obvious because there used to be so much back and forth when building something, but to all my hardware folks, how often do you check your version of stack overflow when building something? 2. Probably a better question would be, what is the bottleneck when you are building something. (say a robot, car, etc?). I plan on participating in more robotics hackathons just to get a better idea of this. In coding the bottleneck has been sort of getting x lines of code right so that the thing works. claude code basically does that for you, but whats the equivalent in physical tasks.
I have also been thinking that if robotics works out this becomes moot but 1. Its going to take 15+ years for robotics to get good enough imo - although I do doubt this a lot sometimes. 2. I think people still will do some physical tasks themselves regardless of robotics working.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。