Machined Learnings: Stochastic Shortest Path Reduction to Cost-Sensitive Multiclass: Towards Better Regret Bounds

The regret bound for the path grid reduction of stochastic shortest path (SSP) without recourse to cost-sensitive multiclass classification (CSMC) was useful for showing consistency of the reduction, i.e., that a minimum regret path choosing policy results from a minimum regret cost-sensitive multiclass classifier. However the bound felt too loose: in particular, the

$O (n^2)$ scaling factor on the regret did not seem consistent with the direct reduction to regression, which had an

$\sim n^{3/2}$ scaling factor. Intuitively, it feels like I should be able to reduce to regression via CSMC and up with the same bound, which suggests an

$O (n)$ scaling factor for the CSMC reduction.

So I've been trying to come up with something along those lines. I don't have anything I'm super happy with, but here are some thoughts. First, here's a theorem that says that the conditional SSP regret is bounded by the conditional cost-sensitive regret summed across subproblems encountered on the SSP regret minimizing path.

Theorem:Optimal Path Regret Bound

For fixed

$x$ , let

$p^* (x)$ denote the set of

$(k, v)$ pairs encountered on the SSP regret minimizing path of length

$n$ from source vertex

$s$ to target vertex

$t$ . Then for all SSP distributions and all cost-sensitive classifiers

$\Psi$ ,

$r_{spath} (\Psi) = E_{x \sim D_x} \left[ \sum_{(k, v) \in p^* (x)} q_{kv} (\Psi | x) \right],$ with the definition

$\forall v, q_{n-1,v} (\Psi | x) = q_{nv} (\Psi | x) = 0$ .

Proof: Let

$\Upsilon^*_{kv} (x)$ be the set of

$(k, v)$ pairs encountered on the SSP regret minimizing path of length

$n - k$ from source vertex

$v$ to target vertex

$t$ . Proof is inductive on the property

$r_{kv} (\Psi | x) \leq \sum_{(k^\prime,v^\prime) \in \Upsilon^*_{kv} (x)} q_{kv} (\Psi | x)$ .

When

$k = n - 2$ , it is easily verified directly that

$r_{n-2,v} (\Psi | x) = q_{n-2,v} (\Psi | x)$ .
For

$k \in [1, n - 2)$ ,

$\begin{aligned} r_{kv} (\Psi | x) &\leq q_{kv} (\Psi | x) + r_{k+1,w^*} (\Psi | x) \\ &\leq q_{kv} (\Psi | x) + \sum_{(k^\prime, v^\prime) \in \Upsilon^*_{k+1,w^*} (x)} q_{k^\prime,v^\prime} (\Psi | x) \\ &= \sum_{(k^\prime, v^\prime) \in \Upsilon^*_{kv} (x)} q_{k^\prime, v^\prime} (\Psi | x), \end{aligned}$ where the first step is from the next step bound lemma, the second step leverages the induction hypothesis, and the third step is the from definition of

$\Upsilon^*_{kv} (x)$ . Noting that

$\Upsilon^*_{1s} (x) = p^* (x)$ yields

$r_{spath} (\Psi | x) = r_{1s} (\Psi | x) \leq \sum_{(k, v) \in p^* (x)} q_{kv} (\Psi | x).$
Taking expectation with respect to

$D_x$ completes the proof.

This theorem is interesting to me for two reasons. First it involves a sum over

$n$ regrets, so it seems closer to a better bound; more about that further below. Second it indicates that really only the classifiers along the optimal path matter; the important ones change as

$x$ varies, and I don't know apriori which ones they are (that would be the solution to SSP I seek!), but nonetheless many of the subproblems being solved are irrelevant. Maybe there is a way to iteratively adjust the cost vectors to take this into account.

Returning to the question of a better bound, it's a short hop from the above theorem to the statement

$r_{spath} (\Psi) \leq (n - 2) E_{x \sim D_x} \left[ \max_{kv}\; q_{kv} (\Psi | x) \right],$ which has the right flavor. For instance, if I reduce the individual CSMC subproblems to regression, I'd pick up an extra

$\sqrt{2n}$ leading factor and a

$\sqrt{\epsilon_{kv}}$ dependence on the regression regret for the subproblem. That would give

$\begin{aligned} r_{spath} (\Psi) &\leq (n - 2) E_{x \sim D_x} \left[ \max_{kv}\; q_{kv} (\Psi | x) \right] \\ &\leq (n - 2) \sqrt{2 n} E_{x \sim D_x} \left[ \max_{kv}\; \sqrt{\epsilon_{kv} (\Psi | x)} \right] \\ &\leq (n - 2) \sqrt{2 n} \sqrt{ E_{x \sim D_x} \left[ \max_{kv}\; \epsilon_{kv} \right]} \end{aligned}$ where

$\max_a\; \sqrt{f_a} = \sqrt{\max_a\; f_a}$ and Jensen's inequality combine in the last step. This somewhat looks like

$O (n^{3/2}) \sqrt{\epsilon}$ that I get from a direct reduction to regression. However I'm not satisfied, because I don't have a good intuition about how the expectation of a single regression subproblem relates to an ``

$L^{\infty}$ style'' expectation over the maximum of a set of regression subproblems.

Machined Learnings

Sunday, August 22, 2010

Stochastic Shortest Path Reduction to Cost-Sensitive Multiclass: Towards Better Regret Bounds

No comments:

Post a Comment