惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

WordPress大学
WordPress大学
T
Threatpost
阮一峰的网络日志
阮一峰的网络日志
美团技术团队
F
Fortinet All Blogs
The GitHub Blog
The GitHub Blog
月光博客
月光博客
V
Visual Studio Blog
T
Tailwind CSS Blog
Stack Overflow Blog
Stack Overflow Blog
博客园 - 聂微东
Jina AI
Jina AI
J
Java Code Geeks
Martin Fowler
Martin Fowler
大猫的无限游戏
大猫的无限游戏
Recorded Future
Recorded Future
C
Check Point Blog
腾讯CDC
N
Netflix TechBlog - Medium
aimingoo的专栏
aimingoo的专栏
罗磊的独立博客
Hacker News: Ask HN
Hacker News: Ask HN
SecWiki News
SecWiki News
博客园 - Franky
Hacker News - Newest:
Hacker News - Newest: "LLM"
N
News | PayPal Newsroom
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
S
Security @ Cisco Blogs
W
WeLiveSecurity
The Last Watchdog
The Last Watchdog
Cloudbric
Cloudbric
F
Full Disclosure
The Cloudflare Blog
Y
Y Combinator Blog
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
Recent Commits to openclaw:main
Recent Commits to openclaw:main
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Google DeepMind News
Google DeepMind News
MongoDB | Blog
MongoDB | Blog
S
Schneier on Security
Schneier on Security
Schneier on Security
Spread Privacy
Spread Privacy
L
LINUX DO - 热门话题
AI
AI
N
News and Events Feed by Topic
T
Tor Project blog
P
Palo Alto Networks Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
H
Hackread – Cybersecurity News, Data Breaches, AI and More
G
Google Developers Blog

叹世界

优雅地在 Docker 中使用 NGINX 笔笔皆是天意 如何在 Windows 中优雅的使用 sing-box CS182/282A Spring 2023 1/18/23 二进制中 1 的个数 ——《C/C++ 位运算黑科技 03》 2 的幂次方 ——《C/C++ 位运算黑科技 02》 绝对值 ——《C/C++ 位运算黑科技 01》 可信软件设计实验环境搭建 C++ 调用 ffmpeg 进行 rtmp 推流 单周期 CPU 模型机的设计与实现
CS182/282A Spring 2023 1/23/23
Homing So · 2023-08-05 · via 叹世界

Published on

Authors
  • avatar
    Name
    Homing So
    Twitter

Standard Optimization-based paradigm for supervised learning

Table of Contents
  • Ingredients
  • Training via Empirical Rick Minimization
  • True Goal is Real World Performance on unseen $X$
  • Complication
  • Further Complication

Ingredients

  • Training Data (i=1,...,n)\left( i = 1, ..., n \right)
  • XiX_i: inputs, covariates;
  • YiY_i: outputs, labels
  • Model fθ(⋅),θ⇐parametersf_\theta{ \left( \cdot \right) },\theta\Leftarrow parameters

Training via Empirical Rick Minimization

θ^=arg⁡min⁡θ1n∑i=1nltrain(yi,fθ(xi))\hat{ \theta } = \mathop{\arg\min}\limits_{ \theta } \frac{1}{n} \sum\limits_{i=1}^n l_{train}\left( y_i, f_\theta\left( x_i \right) \right):

  • choose a θ^\hat{\theta} that we can learn from data that minimizes something
  • l(y,y^) returns a real number l\left( y, \hat{y} \right) \text{ returns a real number }: ll is a loss that compares yy to some prediction of y^\hat{y}, always return a real number(difference between Train Data which may be vectors or numbers) that we can minimizes

True Goal is Real World Performance on unseen XX

mathematical proxy:

  • ∃P(X,Y)\exists{ P\left(X, Y\right) }: assume a probability distribution
  • want a low EX,YE_{X, Y} (expectation over XX and YY) of the loss l(Y,fθ(X))l\left( Y, f_\theta\left( X \right) \right)

Complication

  1. We have no access to P(X,Y)P\left( X, Y \right)

We want to do well on average on stuff we haven't seen, we assume that there's average make some sense and there is some underlying distribution but we don't know it

Solution:

  • (Xtest,i,Ytest,i)i=1ntest\left( X_{test, i}, Y_{test, i} \right)_{ i = 1 }^{ n_{test} }: Collect a Test set of held back Data
  • Test Error=1ntest∑i=1ntestl(ytest,i,fθ(xtest,i))\text{Test Error} = \frac{1}{n_{test}} \sum\limits_{i=1}^{n_{test}} l\left( y_{test, i}, f_{\theta}\left( x_{test, i} \right) \right)

We collected this Test set it is somewhat faithful representation of what we expect to see in the real world, and we hope that the real world follows the kinds of things that probability distributions do, so we hope that averaging and sampling gives us some predictive power on what will actually happen.

I don't know what to do, I know how to do this, so I'll just do this

  1. Loss ltrue(⋅,⋅)l_{true}\left( \cdot, \cdot \right) that we care about is incompatible with our optimizer

You want to do this, this requires something to go around and try to calculate what this argument is. That's something algorithm they'll have to do this work, that algorithm will only work if certain things happen. It might be that the loss you care about doesn't let it do what it needs to do.

You actually care about some loss that's not differentiable, because it's what's practically relevant for your problem. But your minimizer is going to be using derivatives and so yo will say can't work.

Solution:

  • ltrain(⋅,⋅)l_{train} \left( \cdot, \cdot \right): use a surrogate loss that we can work with.

Classic Example:

  1. y∈{cat,dog}y \in \lbrace \text{cat}, \text{dog} \rbrace, ltruel_{true}: Hamming Loss

  2. y→Ry \rightarrow \mathbf{R} where training data mapped to:

    {cat→−1dog→+1\begin{equation} \left\{ \begin{array}{lr} cat \rightarrow -1 & \\ dog \rightarrow +1 & \end{array} \right. \end{equation}
  3. ltrainl_{train}: squared error

When we're evaluating test error, we use ltruel_{true}.

The purpose of evaluating test error is to get a sense of how well you might do on real world data. It is an evaluation that you're doing on a specific model that's already been optimized, no optimization is going to be happening on test error.

Examples You should know:

  • binary classification: logistic loss, hinge loss
  • multi-classification: cross-entropy loss

Aside:

1n∑i=1nltrue(yi,fθ^(xi))\frac{1}{n} \sum\limits_{i=1}^{n} l_{true} \left( y_{i}, f_{\hat{\theta}} \left( x_{i} \right) \right) evaluating on the training set, different than 1n∑i=1nltrain(yi,fθ(xi))\frac{1}{n} \sum\limits_{i=1}^n l_{train} \left( y_i, f_{\theta} \left( x_{i} \right) \right)

This object is kind of practically speaking for everyone who's going to be working on things is debugging.

We want to use this to understand whether or not actually optimizing our training losses doing anything reasonable with respect to the thing we actually care about, and see how well are we actually doing. Because if there was a growth, I'm gonna add some more words. If there was a grotesque mismatch between what you told this to optimize and how you were doing on the thing you were kind of moving towards, then maybe something wrong.

1ntest∑i=1ntestl(ytest,i,fθ(xtest,i))\frac{1}{n_{test}} \sum\limits_{i=1}^{n_{test}} l\left( y_{test,i}, f_{\theta}\left( x_{test,i} \right) \right)

You want this to be a faithful measurement of how things might work in practice, but if you looked at this guy and said 'oh wait, okay I should have changed this, then you go back and you say 'let me look at this again', then you might be running an optimization loop involving you as the optimizer in which you're actually looking at this held back data and this data isn't begin held back anymore. And because it isn't being held back, you might not trust how well things will work in practice. (Kind of perspective on the phenomenon of overfitting)

No such care on 1n∑i=1nltrain(yi,fθ(xi))\frac{1}{n} \sum\limits_{i=1}^{n} l_{train} \left( y_{i}, f_{\theta} \left( x_{i} \right) \right), because you're already using this data to evaluate how well you're doing, In the sense of your optimization algorithm is looking at it all the time. So whether you choose to take other views of it, Is cost free.

  1. You run your Optimizer with your surrogate loss and we get "crazy" values for θ^\hat{\theta} you're on the Optimizer, and/or you get really bad test performance. (Another kind of perspective on the phenomenon of overfitting)

Solution:

  • Add an explicit regression during training: θ^=arg⁡min⁡θ(1n∑i=1nltrain(yi,fθ(xi)))+Rλ(θ)\hat{\theta} = \mathop{\arg\min}\limits_{\theta} \left( \frac{1}{n} \sum\limits_{i=1}^{n} l_{train} \left( y_{i}, f_{\theta} \left( x_{i} \right) \right) \right) + R_{\lambda} \left( \theta \right), e.g. Ridge Regression: R(θ)=λ∥θ∥2R\left( \theta \right) = \lambda\| \theta \|^2

Notice: we added another parameter λ\lambda. How do we choose it?

Native Hyper parameters: θ^=arg⁡min⁡θ,λ≥0()\hat{\theta} = \mathop{\arg\min}\limits_{\theta, \lambda^{\geq0}} \left( \frac{}{} \right)

  • Split parameters into "Normal parameters θ\theta, and Hyper parameters λ\lambda"

"Hyper parameter is a parameter that if you let the optimizer just work with it, it would go crazy, so you have to segregate it out"

Hold Out additional Data(Validation Set), use that to optimize hyper parameters

When you do hyper parameter optimization using the validation set, you might be using different kind of optimizer than you used for the argument you’re doing for finding your parameters. So typically in the context of deep learning this thing is always going to be some variation of gradient descent is what we use to do this kind of setting. But for hyper parameter setting you might be doing a Brute Force grid search or searches based invoking ideas related to things like multi-ram Bandits or other techniques of you know zeroth order optimization algorithms that will help you do that, you can also use for some hyper parameter searches gradient based approaches when it.

All Solution:

  • Simplify model: "Reduce model order"

Further Complication

  • The Optimizer might have its own parameters. e.g. learning rate

Generally, optimizers might have their own tunable knobs. And in pratice, as a someone trying to do deep learning, you’re going to have discrete choice of which optimizer to use.

You see a two subtly different perspective.

  • Most basic/root optimizer approach: Gradient Descent

Gradient Descent is an iterative optimization approach where you make improvements and you make then locally.

  • Idea: Change parameter a little bit at a time.

All you care about is how does your loss behave in the neighborhood of the parameters you're in.

So look at local neighborhood of loss around.

θt+1=θt+η(−∇θLtrain,θ),Ltrain,θ=1n∑i=1nltrain(yi,fθ(xi))+R(⋅)\theta_{t+1} = \theta_{t} + \eta \left( - \nabla_{\theta} L_{train, \theta} \right), L_{train, \theta} = \frac{1}{n} \sum\limits_{i=1}^{n} l_{train} \left( y_{i}, f_{\theta} \left( x_{i} \right) \right) + R \left( \cdot \right)

This is a Discrete-time Dynamic System

η← "step size"/"learning rate" \eta \leftarrow \text{ "step size"/"learning rate" }, this η\eta controls stability of this system

η\eta too large, Dynamics go unstable (it oscillate)

η\eta too small, it takes to long too converge

Ltrain(θ++Δθ)≈Ltrain(θ+)+∂∂θLtrain⏟"row"⌋θ+ΔθL_{train} \left( \theta_{+} + \Delta\theta \right) \approx L_{train} \left( \theta_{+} \right) + \underbrace{\frac{\partial}{\partial\theta} L_{train}}_{\text{"row"}} \rfloor_{\theta_{+}} \Delta\theta

The transpose of this "row" is called the gradient