There is a problem I've worked on before which appears similar to SSP, namely, the problem of picking m advertisements from a candidate set of n advertisements. Ignoring positional effects and inter-ad externalities, this is characterized by a regret function rad(h)=E(x,f)∼D[m∑k=1f(hk(x))]−minh∈X→SmE(x,f)∼D[m∑k=1f(hk(x))], where Sm={a∈Am| i≠j⇒ai≠aj} is the set of feasible result vectors, A is a set of advertisements, and h:X→Sm is a result vector choosing function, x is the features associated with a problem instance, f are the (negative) costs per ad associated with a problem instance, and D=Dx×Df|x is the distribution of problem instances. Call this the Ad Selection Problem (ASP).
Reduction to Regression Back when I first encountered this problem (over 10 years ago), reduction to regression seemed the natural (only?) way to solve it. A regression reduction for ASP is a function g:X×A→R which defines an associated result vector choosing function hg:X→Sm via hg(x)=argminh∈Smm∑k=1g(x,hk). In other words, estimate the values and then choose the ads with the m lowest costs (breaking ties arbitrarily). I would like to bound rad in terms of the regret on the underlying regression problem ϵL(g)=νL(g)−mingνL(g), defined in terms of the error on the underlying regression problem νL(g)=E(x,f)∼D[1|A|∑a∈AL(g(a),f(a))]. Using techniques similar to those used to analyze the reduction of SSP without recourse to regression, I can show the following.
Theorem:ASP Regression Regret Bound
For all ASP distributions D and regressors g, rad(hg)≤√2min{m,|A|−m}√|A|√ϵL2(g). Proof: See appendix.
Reduction to CSMC Although the regression reduction analysis for ASP was very similar to SSP without recourse, the following consistent reduction of ASP to CSMC does not bear much resemblance the reduction for SSP without recourse to CSMC.
It is the uniqueness constraint that frustrates the construction of a sequence of CSMC subproblems, so instead of operating in Sm, I'll operate in Am, but I'll dictate that duplicate occurrences of the same ad have zero value. For a result vector selector ˜h:X→Am we have associated duplicate-sensitive regret ˜rad(˜h)=E(x,f)∼D[m∑k=1f(hk(x))∏j<k1hk(x)≠hj(x)]−minh∈X→AmE(x,f)∼D[m∑k=1f(hk(x))∏j<k1hk(x)≠hj(x)]. Assuming that all costs are strictly nonpositive (i.e., every ad has a nonnegative value for being shown), then the minimizer will not have duplicates in it, so rad(h)=˜rad(h) for any h whose range is Sm. So the strategy will be to bound ˜rad by a CSMC reduction, and then force the learned h to take values only in Sm.
Training proceeds in rounds from 1 to m; on each round a cost-sensitive classifier is learning to pick the best ad still available. Picking an ad already chosen is allowed but provides zero incremental value. At test time, the classifier from each round is ask to chose an ad; any duplicates that are encountered are thrown away and replaced with random unique ads. This de-duplication procedure ensures that rad(hΨ)≤˜rad(˜hΨ), where ˜hΨ:X→Am is the potentially duplicate generating learned result generator, and hΨ:X→Sm is the de-duplicated version of the learned result generator. In general, this trick will work anytime the costs have an upper bound.
Algorithm:Set Select Train
Data: Candidate set A, number of positions to populate m.
Data: Training data set S.
Input: Cost-sensitive classification routine Learn.
Result: Trained classifiers {Ψk|k∈[1,m]}.
Data: Training data set S.
Input: Cost-sensitive classification routine Learn.
Result: Trained classifiers {Ψk|k∈[1,m]}.
- Define γ0(⋅)=∅.
- For each position k from 1 to m:
- Sk=∅.
- For each example (x,f)∈S:
- Let γk−1(x) be the predicted best set from the previous iteration.
- For each ad a, let ca(x,k)=1a∉γk−1(x)f(a).
- Sk←Sk∪{(x,c(x,k))}.
- Let Ψk=Learn(Sk).
- Let γk(x)=Ψk(x)∪γk−1(x).
- Return {Ψk|k∈[1,m]}.
Algorithm:Set Select Test
Data: Candidate set A, number of positions to populate m.
Data: Instance feature realization x.
Data: Trained classifiers {Ψk|k∈[1,m]}.
Result: Set hΨ:X→Sm.
Data: Instance feature realization x.
Data: Trained classifiers {Ψk|k∈[1,m]}.
Result: Set hΨ:X→Sm.
- ˜hΨk(x)=Ψk(x).
- hΨ(x)=Dedup(˜hΨ(x))
- Arbitrarily replace duplicate ads with unique ads, e.g., at random.
- Draw (x,f) from D.
- Draw k uniform on [1,m].
- Let x′=(x,k).
- Let c(a)=1a∉γk−1(x)f(a) for a∈A.
- Create cost-sensitive example (x′,c).
Theorem:ASP CSMC Regret Bound
For all ASP distributions D where f≤0 a.s., for all cost-sensitive classifiers Ψ, rad(hΨ)≤mq(Ψ).Proof: Consider a fixed x. It will be useful to talk about ˜rk(˜hΨ|x), the duplicate-sensitive regret for result of size k, ˜rk(˜hΨ|x)=Ef∼Df|x[k∑n=1f(Ψn(x))∏j<n1Ψn(x)≠Ψj(x)]−minh∈AkEf∼Df|x[k∑n=1f(hn)∏j<n1hn≠hj]. Let h∗(k,x)=argminh∈AkEf∼Df|x[k∑n=1f(hn)∏j<n1hn≠hj] be a minimizer of the second term (which is unique up to permutations and ties); note any h∗(k,x) will select the smallest k ads with respect to conditional expected cost. Then ˜rk(˜hΨ|x)−˜rk−1(Ψ|x)=Ef∼Df|x[f(Ψk(x))∏j<k1Ψk(x)≠Ψj(x)]−mina∈AEf∼Df|x[f(a)∏j<k1a≠h∗j(k−1,x)]≤Ef∼Df|x[f(Ψk(x))∏j<k1Ψk(x)≠Ψj(x)]−mina∈AEf∼Df|x[f(a)∏j<k1a≠Ψj(k)]=qk(Ψ|x), where the inequality follows from the optimality of h∗(k−1,x). Since ˜r1(˜hΨ|x)=q1(Ψ|x), we have ˜rad(˜hΨ|x)=˜rm(˜hΨ|x)≤m∑k=1qk(Ψ|x)=mq(Ψ|x). Taking expectations with respect to D and noting that f≤0 a.s. implies rad(hΨ)≤˜rad(˜hΨ) completes the proof.
The reduction is consistent but seems inefficient. This is because reducing to regression via CSMC results in a m√|A|√ϵL2 bound, whereas reducing to regression directly results in a √m√|A|√ϵL2 bound, which is √m better.
This technique will also work whenever f is a.s. bounded above by a constant, since the constant can be subtracted from all costs to create a derived problem where a.s. f≤0. Effectively what's needed is a mechanism for penalizing the constraint violation (creating a duplicate) with a cost that is guaranteed to be the worst possible cost; that way, any correction of the constraint violation is guaranteed to improve the solution with respect to the original regret. (Update: having re-read this, I feel like the weaker condition Ef∼Df|x[f(a)]≤0 for all a and a.s. x is sufficient).
The above reduction assumes historical data sampled i.i.d. from D which is unrealistic because in practice we only see the values associated with advertisements that are shown. So what would be really cool would be to compose this reduction with the offset tree to get something that solves the warm start problem.
Another note of impracticality: in realistic ad serving settings the set of candidate ads is highly volatile, e.g., due to budget exhaustion or per-user frequency capping. Reduction to regression deals with this very naturally (by restricting the scope of the argmax). Dealing with that in the CSMC reduction is more awkward, but if filtration rates are modest perhaps learning 2m classifiers would be sufficient (this could also allow a better dedup strategy). Reduction to regression also deals somewhat well with the introduction of new advertisements into the system (assuming a feature space such that the regressor generalizes reasonably to novel ads), and I have no good ideas on how to make the CSMC reduction capable of handling that.
Appendix: Here's the regression reduction analysis.
Consider an instance x with associated value distribution Df|x, and suppose our regressor differs from a minimum error regressor's estimate f∗(a) by δ(a) for all a∈A, g(a)=f∗(a)+δ(a). Imagine an adversary which attempts to create a certain amount of ASP regret on this instance while minimizing the amount of regression regret on this instance. For L2 loss specifically, the adversary is faced with the following family of choices indexed by h∗∗, minδ∑a∈Aδ(a)2s.t.∀h≠h∗∗,∑a∈h∗∗f∗(a)+δ(a)≤∑a∈hf∗(a)+δ(a), where h∗∗ corresponds to the decision made by hg. This simplification leverages the special property of L2 loss that the regret minimizing estimator is the conditional mean, f∗(a)=Ef∼Df|x[f(a)].
The KKT stationarity condition implies 2δ(a)+∑h≠h∗∗(1a∈h∗∗−1a∈h)μh=0. Complementary slackness indicates μh=0 unless the constraint for h is active. Let h∗=argminh∈SmEf∼Df|x[∑a∈hf(a)] be the true optimal path choice, in which case only μh∗ need be non-zero. The stationarity condition simplifies to δ(a)={μh∗/2if e∈h∗ and e∉h∗∗;−μh∗/2if e∉h∗ and e∈h∗∗;0otherwise. Intuitively, since errors are penalized convexly, the adversary does best by concentrating equal and opposite error on those advertisements which are exclusively in either h∗∗ or h∗. Let κ(h∗,h∗∗)=|h∗|+|h∗∗|−2|h∗∩h∗∗|. Substituting into the inequality constraint yields ∑a∈h∗∗f∗(a)−∑a∈h∗f∗(a)≤∑a∈h∗δ(a)−∑a∈h∗∗δ(a)=κ(h∗,h∗∗)√|A|κ(h∗,h∗∗)1|A|∑a∈Aδ(a)2≤√2min{m,|A|−m}√|A|√1|A|∑a∈Aδ(a)2, where the last step leverages κ(h∗,h∗∗)≤2min{m,|A|−m}. Taking expectation with respect to Dx yields the result rad(hg)≤√2min{m,|A|−m}√|A| Ex∼Dx[√1|A|∑a∈Aδ(a)2]≤√2min{m,|A|−m}√|A|√ϵL2(g), where the last step is due to Jensen's inequality.
No comments:
Post a Comment