Paper-reading log

The concept of "type" has been used without a consistent, precise definition in discussions about programming languages for 60 years. In this essay I explore various concepts lurking behind distinct uses of this word, highlighting two traditions in which the word came into use largely independently: engineering traditions on the one hand, and those of symbolic logic on the other. These traditions are founded on differing attitudes to the nature and purpose of abstraction, but their distinct uses of "type" have never been explicitly unified. One result is that discourse across these traditions often finds itself at cross purposes, such as overapplying one sense of "type" where another is appropriate, and occasionally proceeding to draw wrong conclusions. I illustrate this with examples from well-known and justly well-regarded literature, and argue that ongoing developments in both the theory and practice of programming make now a good time to resolve these problems.

Kell gets at something that bugs me periodically, but that I hadn't seen laid out explicitly: "type" in programming languages clearly has at least two quite different traditions. One comes out of programming language implementation and talks about types in the sense that a language like Java has typed variables (a variable can have type "int" or "Object" or "String" or whatever). The other comes out of symbolic logic and talks about typing disciplines, type judgments, etc. (more in the sense of "the typed Lambda Calculus" or "Hindley-Milner" or "Martin-Löf type theory"). As the paper notes, these two meanings of type haven't really been unified, though there is some slippage between the two. The author's sense (and mine) is that in the programming languages and especially compilers literature, the data-type meaning used to be predominant, but the version from logic has become predominant in certain parts of programming languages research more recently. Further reading found from reverse-citations to Kell's paper: S. Martini "Several Types of Types in Programming Languages" (HaPoC 2015).

2024-09-22

Repurposing diffusion-based image generators for monocular depth estimation

B. Ke, A. Obukhov, S. Huang, N. Metzger, R.C. Daudt, K. Schindler. Conf. Computer Vision and Pattern Recognition, 2024.

Tags: stable diffusion, computer vision

[doi] [pdf]

Abstract:

Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth from a single image is geometrically ill-posed and requires scene understanding, so it is not surprising that the rise of deep learning has led to a breakthrough. The impressive progress of monocular depth estimators has mirrored the growth in model capacity, from relatively modest CNNs to large Transformer architectures. Still, monocular depth estimators tend to struggle when presented with images with unfamiliar content and layout, since their knowledge of the visual world is restricted by the data seen during training, and challenged by zero-shot generalization to new domains. This motivates us to explore whether the extensive priors captured in recent generative diffusion models can enable better, more generalizable depth estimation. We introduce Marigold, a method for affine-invariant monocular depth estimation that is derived from Stable Diffusion and retains its rich prior knowledge. The estimator can be fine-tuned in a couple of days on a single GPU using only synthetic training data. It delivers state-of-the-art performance across a wide range of datasets, including over 20% performance gains in specific cases.

A hot thing in computer vision at the moment (not my research area, but I’m a downstream user) is Stable Diffusion as a foundation model for monocular depth estimation (i.e. image → depth map, from a single image). This paper, on a system called Marigold, appears to have either originated that, or at least kicked off the current interest. An intriguing thing about this line of work is that it seems to have implications for the question of what SD "knows" about the images it generates. It must have at least some level of 3d scene understanding, which poses difficulty for the view that SD is just surface-level pastiche/averaging. Although Robert Zubeck pointed out to me: "On the other hand, recovering the value of a (hidden) depth buffer channel seems very on-track with recovering the values of visible channels?" True, it's not a big leap from what you can already infer from the visible channels (e.g. SD can render object overlaps reasonably coherently). Nonetheless, interesting and at least useful. I ran across this paper via a citation from "Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think" (arXiv:2409.11355, 2024) which claims to improve the paradigm by simplifying and speeding up training & inference considerably.

2024-07-30

Complex behavior from intrinsic motivation to occupy future action-state path space

J. Ramírez-Ruiz D. Grytskyy, C. Mastrogiuseppe, Y. Habib, R. Moreno-Bote. Nature Communications, 2024.

Tags: intrinsic motivation, reinforcement learning

[doi]

Abstract:

Most theories of behavior posit that agents tend to maximize some form of reward or utility. However, animals very often move with curiosity and seem to be motivated in a reward-free manner. Here we abandon the idea of reward maximization and propose that the goal of behavior is maximizing occupancy of future paths of actions and states. According to this maximum occupancy principle, rewards are the means to occupy path space, not the goal per se; goal-directedness simply emerges as rational ways of searching for resources so that movement, understood amply, never ends. We find that action-state path entropy is the only measure consistent with additivity and other intuitive properties of expected future action-state path occupancy. We provide analytical expressions that relate the optimal policy and state-value function and prove convergence of our value iteration algorithm. Using discrete and continuous state tasks, including a high-dimensional controller, we show that complex behaviors such as “dancing”, hide-and-seek, and a basic form of altruistic behavior naturally result from the intrinsic motivation to occupy path space. All in all, we present a theory of behavior that generates both variability and goal-directedness in the absence of reward maximization.

Intriguing addition to the literature on intrinsic agent motivation. Proposes that agents act to "maximiz[e] occupancy of future paths of actions and states", which they call the maximum occupancy principle (MOP). They connect it to two bodies of related work: 1) entropy-regularized reinforcement learning, which they distinguish on the basis that in that work, "extrinsic rewards still serve as the major drive of behavior, and arbitrary mixtures of action-state entropy are rarely considered", 2) "reward-free theories of behavior", such as novelty and compression, which they distinguish on the basis that those depend on agent history, while MOP doesn't. They do note that in the 2nd category, empowerment maximization and free-energy-principle based agents are closer to their setup, but argue that those "tend to collapse to deterministic policies with little behavioral variability", while "MOP results in lively and seemingly goal-directed behavior by taking behavioral variability and the constraints of embodied agents as principles".

2024-01-06

ReLOAD: Reinforcement learning with optimistic ascent-descent for last-iterate convergence in constrained MDPs

T. Moskovitz, B. O'Donoghue, V. Veeriah, S. Flennerhag, S. Singh, T. Zahavy. Intl. Conf. Machine Learning, 2023.

Tags: reinforcement learning

[pdf]

Abstract:

In recent years, reinforcement learning (RL) has been applied to real-world problems with increasing success. Such applications often require to put constraints on the agent's behavior. Existing algorithms for constrained RL (CRL) rely on gradient descent-ascent, but this approach comes with a caveat. While these algorithms are guaranteed to converge on average, they do not guarantee last-iterate convergence, i.e., the current policy of the agent may never converge to the optimal solution. In practice, it is often observed that the policy alternates between satisfying the constraints and maximizing the reward, rarely accomplishing both objectives simultaneously. Here, we address this problem by introducing Reinforcement Learning with Optimistic Ascent-Descent (ReLOAD), a principled CRL method with guaranteed last-iterate convergence. We demonstrate its empirical effectiveness on a wide variety of CRL problems including discrete MDPs and continuous control. In the process we establish a benchmark of challenging CRL problems.

Figure 1 starts out with a nice (and funny) illustrative example of a failure case in constrained reinforcement learning. They train a bipedal walker that is supposed to maximize distance traveled (reward) while keeping the head below a specified threshold (constraint). The training converges on average (Fig. 1(a)), but in individual runs, the steady-state behavior is oscillation between maximizing reward and satisfying constraints, never both simultaneously (Fig. 1(b)). They formalize this issue in section 2.2 with two definitions: Average-Iterate Convergence and Last-Iterate Convergence. They argue the latter is really what you want, since you would like to be able to stop at some point, and recover a policy from the final iteration. They propose an algorithm, ReLOAD, with such a property.

2024-01-05

Pictorial syntax

K. J. Lande. Mind & Language, 2024.

Tags: cognitive science

[doi]

Abstract:

It is commonly assumed that images, whether in the world or in the head, do not have a privileged analysis into constituent parts. They are thought to lack the sort of syntactic structure necessary for representing complex contents and entering into sophisticated patterns of inference. I reject this assumption. "Image grammars" are models in computer vision that articulate systematic principles governing the form and content of images. These models are empirically credible and can be construed as literal grammars for images. Images can have rich syntactic structure, though of a markedly different form than sentences in language.

This paper aims to give a cognitive science interpretation of image grammars, arguing they establish the existence of syntactic structure for images. As I read the argument: image grammars exist and empirically work in various applications; they are in fact properly called grammars, as grammars for language are; the foregoing facts combined should be interpreted to show that images can have syntactic structure. This may seem obvious, but the article quotes widespread views that seem to say the opposite. "My conclusion here is modest: Pictures can have syntactic structure. Image grammars provide instructive models of what that syntax could look like."

2023-12-20

LoRA: Low-rank adaptation of large language models

E. Hu et al.. Intl. Conf. Learning Representations, 2022.

Tags: machine learning

[pdf]

Abstract:

An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example – deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by a factor of 10,000 and the GPU memory requirement by a factor of 3. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA.

LoRA is now a widely used method for fine-tuning generative models, and this is the paper that proposed it (originally posted on arXiv in 2021). Fine-tuning uses relatively little data compared to the amount of data used in the original training (which they call pre-training) of large models like GPT-3. So it seems wasteful to have to modify the entire dense weight matrix, and in fact it's somewhat surprising that doing so would even work. A previous paper gives some insight into why fine-tuning on relatively little data works: Aghajanyan et al., "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning" (arXiv, 2020) demonstrate that fine-tuning effectively takes place in a weight-space of lower intrinsic dimensionality than the full weight space. This paper takes that insight and gives an efficiently implementable variation by replacing the concept of intrinsic dimensionality with matrix rank – one doesn't necessarily imply the other, but the authors of this paper hypothesize that in addition to low intrinsic dimensionality, "the change in weights during model adaptation also has a low 'intrinsic rank'". Instead of updates to the full matrix, they freeze the original weights, and update a rank-factorized representation of the difference ∆W = BA, where BA is a very low-rank factorization (as low as 1). It is initialized by setting A to Gaussian noise and B to all zeros, so the initial fine-tuning delta is zero. This low-rank representation of the fine-tuning delta can be stored separately from the original weights if desired, and added back to them when desired for fine-tuned inference. That is the other claimed advantage of this method: a different set of approaches to make fine-tuning more efficient than full weight updates instead adds new adaptation layers, which are much smaller than the original network, and trains just those. But that increases latency by adding new layers, while LoRA doesn't.

2023-07-04

Five stages of accepting constructive mathematics

A. Bauer. Bulletin of the AMS, 2017.

Tags: constructive mathematics

[doi]

Abstract:

On the odd day, a mathematician might wonder what constructive mathematics is all about. They may have heard arguments in favor of constructivism but are not at all convinced by them, and in any case they may care little about philosophy. A typical introductory text about constructivism spends a great deal of time explaining the principles and contains only trivial mathematics, while advanced constructive texts are impenetrable, like all unfamiliar mathematics. How then can a mathematician find out what constructive mathematics feels like? What new and relevant ideas does constructive mathematics have to offer, if any? I shall attempt to answer these questions.

A mostly lighthearted, readable introduction to constructive mathematics (although there are also proofs!), using the literary device of the five stages of processing trauma – denial, anger, bargaining, depression, acceptance. The person doing the processing is taken to be the classical mathematician who initially rejects constructive mathematics. Covers a grab-bag of things, such as the distinction between proof-by-contradiction and proof-of-negation; how modern constructive mathematics differs from historical forms like intuitionism and Russian constructivism; what a topos is; and the relation between constructive mathematics and computability. The first four sections are fairly easy to follow (excluding a few parts of section 3), while the fifth contains examples of constructive proofs that are a bit more dense.

2023-03-26

Understanding deep learning requires rethinking generalization

C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals. Intl. Conf. Learning Representations, 2017.

Tags: machine learning, generalization

[doi]

Abstract:

Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family, or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models.

An influential paper showing, through a few simple experiments, that there must be something seriously missing in statistical learning theory as of 2017. The standard story about generalization in ML was that it's driven by avoiding overfitting through two methods: either restricting model capacity (number of parameters, VC dimension, Rademacher complexity, etc.), or regularization, or both. The paper shows both empirically and theoretically that neural networks don't have meaningfully restricted model capacity. They can achieve zero test error even when labels are totally randomized and/or image data is replaced with noise (i.e. they are forced to just memorize the training data); an accompanying theorem shows that neural networks of a certain size (just about always achieved in practice) can represent any function of a finite sample. The paper then goes on to show that regularization doesn't seem to be the story either: fairly good generalization happens even when various implicit regularization tricks (dropout, data augmentation, weight decay, etc.) are removed. A version of this paper with a short addendum was republished in CACM vol. 64, no. 3 (2021).

2022-06-12

The unreliability of naive introspection

E. Schwitzgebel. The Philosophical Review, 2008.

Tags: philosophy

[doi] [pdf]

Abstract:

We are prone to gross error, even in favorable circumstances of extended reflection, about our own ongoing conscious experience, our current phenomenology. Even in this apparently privileged domain, our self-knowledge is faulty and untrustworthy. We are not simply fallible at the margins but broadly inept. Examples highlighted in this essay include: emotional experience (for example, is it entirely bodily; does joy have a common, distinctive phenomenological core?), peripheral vision (how broad and stable is the region of visual clarity?), and the phenomenology of thought (does it have a distinctive phenomenology, beyond just imagery and feelings?). Cartesian skeptical scenarios undermine knowledge of ongoing conscious experience as well as knowledge of the outside world. Infallible judgments about ongoing mental states are simply banal cases of self-fulfillment. Philosophical foundationalism supposing that we infer an external world from secure knowledge of our own consciousness is almost exactly backward.

Some philosophers (and other thinkers) argue that if there's one thing we have reliable direct knowledge of, it's our own subjective experience of the world. We may not be able to reliably access an objective external world, but at least we know how we subjectively perceive it. Schwitzgebel doesn't buy that. He argues that if anything our introspective knowledge of our own experience is *less* reliable than our perception of the external world. He notes that this kind of skepticism about introspection is not controversial in all intellectual traditions; he sees his position as unusual within analytic philosophy (the tradition he works in), but points to "Eastern meditative traditions" and "the first era of scientific psychology and 'phenomenology' (circa 1900)" as two traditions in which such a view is/was common. I'm undecided whether I find his argument convincing, but it's intriguing.

2022-01-12

The evolution of operating systems

P. Brinch Hansen. Classic Operating Systems (Springer), 2001.

Tags: operating systems, history

[doi] [pdf]

Abstract:

The author looks back on the first half century of operating systems and selects his favorite papers on classic operating systems. These papers span the entire history of the field from the batch processing systems of the 1950s to the distributed systems of the 1990s. Each paper describes an operating system that combines significant ideas in an elegant way. Most of them were written by the pioneers who had the visions and the drive to make them work. The author summarizes each paper and concludes that operating systems are based on a surprisingly small number of ideas of permanent interest.

This is the introduction to an anthology of classic operating systems papers edited by Brinch Hansen (entitled Classic Operating Systems), and serves as a kind of survey of the history of the field through milestones he considers important. He summarizes each paper, listed in chronological order and roughly grouped, forming a nice tour through what he identifies as seven eras of operating system development: I. Open Shop, II. Batch Processing, III. Multiprogramming, IV. Timesharing, V. Concurrent Programming, VI. Personal Computing, and VII. Distributed Systems. Lots of good material packed into this 34-page book chapter.

2021-07-12

Is long horizon RL more difficult than short horizon RL?

R. Wang, S.S. Du, L.F. Yang, S.M. Kakade. Conf. Neural Information Processing Systems, 2020.

Tags: reinforcement learning

[pdf]

Abstract:

Learning to plan for long horizons is a central challenge in episodic reinforcement learning problems. A fundamental question is to understand how the difficulty of the problem scales as the horizon increases. Here the natural measure of sample complexity is a normalized one: we are interested in the number of episodes it takes to provably discover a policy whose value is ε near to that of the optimal value, where the value is measured by the normalized cumulative reward in each episode. In a COLT 2018 open problem, Jiang and Agarwal conjectured that, for tabular, episodic reinforcement learning problems, there exists a sample complexity lower bound which exhibits a polynomial dependence on the horizon – a conjecture which is consistent with all known sample complexity upper bounds. This work refutes this conjecture, proving that tabular, episodic reinforcement learning is possible with a sample complexity that scales only logarithmically with the planning horizon. In other words, when the values are appropriately normalized (to lie in the unit interval), this results shows that long horizon RL is no more difficult than short horizon RL, at least in a minimax sense. Our analysis introduces two ideas: (i) the construction of an ε-net for optimal policies whose log-covering number scales only logarithmically with the planning horizon, and (ii) the Online Trajectory Synthesis algorithm, which adaptively evaluates all policies in a given policy class using sample complexity that scales with the log-covering number of the given policy class. Both may be of independent interest.

This is a pretty theoretical paper, but it's interesting that their result is a negative answer to the question in the title. They show that learning a policy over longer horizons is not in principle much harder than over shorter horizons, at least not solely due to the horizon itself (vs. say the state space being larger). They consider the following setting: We have a finite-horizon, episodic MDP with horizon H. How many samples are needed to provably find an ε-optimal policy? A conjecture from COLT 2018 was that the number of samples needed should scale polynomially with H. This paper shows that in fact it is possible to scale only logarithmically. They do so by constructing an algorithm that is not very practical to actually use (it needs to instantiate every possible policy in a policy class), but the result is nonetheless surprising to me; and apparently to others, given that the COLT 2018 conjecture was the opposite. They end with some discussion of how this relates to other bounds. They note that their algorithm has worse dependence on the size of the action set and state space than some other work, which might mean there's in practice a tradeoff between scaling with horizon and other measures of problem size. But they conjecture this will not turn out to be the case, and that sample complexity will scale as Õ(|S| |A| poly(log H) / ε^2).

2021-06-21

Generative text using classical nondeterminism

I. Horswill. Workshop on Experimental AI in Games, 2020.

Tags: natural language generation, Prolog, grammars

[pdf]

Abstract:

Many recent generative text systems combine a context-free grammar base with some set of extensions such as tagging or inline JavaScript. We argue that the restriction to CFGs is unnecessary and that the standard pattern-directed, nondeterministic control structures common to Prolog, definite-clause grammars, and many HTNs, support a superset of these capabilities while still being simple to implement. We describe Step, a generative text system for Unity games based on higher-order HTNs that so far as we can determine from published descriptions, generalizes this previous work. We then describe syntactic extensions to make Step more natural as a text-authoring language. Finally, we discuss how Step and similar systems can be implemented very compactly in modern mainstream programming languages.

Horswill argues "that nondeterministic programming is the appropriate *internal* representation for CFG-like text generators, regardless of what designer-facing GUI or syntactic sugar is used to aid the authoring process", and demonstrates that with Step, a text-generation language that "manages to hide much of the Prolog-ishness of the underlying representation from the author". He presents a hierarchy of grammars from an NLG perspective: template substitution, context-free grammars, attribute grammars, definite-clause grammars, HTNs, and higher-order HTNs. He argues that context-free grammars are really too low on the hierarchy to be a good choice for many things, but were likely chosen because they are easy to implement, while, alas, the literature on implementing things like definite clause grammars is either on the one hand very low-level 1980s stuff, or on the other, very high-level modern stuff aimed at implementations in functional programming languages. Horswill aims to show that more powerful grammars can in fact be implemented straightforwardly in a mainstream language like C#. Very thought-provoking paper that I think does a good job opening up the question of the relationship between various grammar choices and generative systems.

2020-10-25

What you always wanted to know about Datalog (and never dared to ask)

S. Ceri, G. Gottlob, L. Tanca. IEEE Trans. Knowledge and Data Engineering, 1989.

Tags: logic programming, databases

[doi]

Abstract:

Datalog is a database query language based on the logic programming paradigm; it has been designed and intensively studied over the last five years. We present the syntax and semantics of Datalog and its use for querying a relational database. Then, we classify optimization methods for achieving efficient evaluations of Datalog queries, and present the most relevant methods. Finally, we discuss various enhancements of Datalog, currently under study, and indicate what is still needed in order to extend Datalog's applicability to the solution of real-life problems. The aim of this paper is to provide a survey of research performed on Datalog, also addressed to those members of the database community who are not too familiar with logic programming concepts.

A nice overview of the design and goals of Datalog. Syntactically, Datalog is a subset of Prolog, but its semantics differ considerably. Its principal design goal was to allow database-like usage. Prolog's semantics are closely tied to its evaluation strategy (SLDNF resolution, a type of backwards chaining), which in particular means answers to queries are computed one at a time. This is inefficient when the goal is to retrieve a potentially large number of results from a database, so Datalog is designed to allow set-oriented methods. In addition, Datalog is designed so that queries will always have a finite number of results, and can be computed via either forwards- or backwards-chaining (also called top-down and bottom-up). Set-oriented evaluation is discussed at the end of section II and in section III; section III also discusses how Datalog relates to relational algebra. One quirk of this paper due to its age is that at the time, "pure Datalog" was defined as not permitting negation at all. Most current Datalog systems allow some form of negation (e.g. stratified negation); the paper discusses that (section VI) as an extension termed "Datalog¬".

2020-09-23

The gambler's problem and beyond

B. Wang, S. Li, J. Li, S.O. Chan. Intl. Conf. Learning Representations, 2020.

Tags: reinforcement learning

[pdf]

Abstract:

We analyze the Gambler's problem, a simple reinforcement learning problem where the gambler has the chance to double or lose the bets until the target is reached. This is an early example introduced in the reinforcement learning textbook by Sutton and Barto (2018), where they mention an interesting pattern of the optimal value function with high-frequency components and repeating non-smooth points. It is however without further investigation. We provide the exact formula for the optimal value function for both the discrete and the continuous cases. Though simple as it might seem, the value function is pathological: fractal, self-similar, derivative taking either zero or infinity, and not written as elementary functions. It is in fact one of the generalized Cantor functions, where it holds a complexity that has been uncharted thus far. Our analyses could provide insights into improving value function approximation, gradient-based algorithms, and Q-learning, in real applications and implementations.

This paper exhaustively analyzes a simple-to-define MDP that Sutton and Barto introduced in their 1998 reinforcement learning textbook. The MDP, Example 4.3 (on p. 84 in the 2nd edition) encodes a simple gambling game, as follows: You start with some amount of money, and can bet it double-or-nothing, with a house edge (say 40% double, 60% nothing). You start with between $1 and $99, lose if you get to $0, and win if you get to $100. What should you bet at each amount? The optimal value function looks vaguely like you'd expect: more money is more valuable than less (although it isn't as smooth as one might expect). The optimal policy looks pretty bizarre though, e.g. bet your whole bankroll at $50, but bet only $1 at $51. I believe this MDP is introduced in the textbook as an example of a simple problem whose optimal policy is computationally easy to find with value iteration, but where the optimal policy is nonetheless so counterintuitive that almost nobody would have guessed it without actually computing. A following exercise (4.7 in the 1st ed, 4.8 in the 2nd ed) asks readers to explain why the optimal policy takes on such a strange form. Well, a complete answer to that textbook exercise resulted in this 23-page paper published at ICLR 2020.

2020-06-17

Answer set programming for non-stationary Markov decision processes

L.A. Ferreira, R.A.C. Bianchi, P.E. Santos, R. Lopez de Mantaras. Applied Intelligence, 2017.

Tags: logic programming, machine learning, reinforcement learning

[doi]

Abstract:

Non-stationary domains, where unforeseen changes happen, present a challenge for agents to find an optimal policy for a sequential decision making problem. This work investigates a solution to this problem that combines Markov Decision Processes (MDP) and Reinforcement Learning (RL) with Answer Set Programming (ASP) in a method we call ASP(RL). In this method, Answer Set Programming is used to find the possible trajectories of an MDP, from where Reinforcement Learning is applied to learn the optimal policy of the problem. Results show that ASP(RL) is capable of efficiently finding the optimal solution of an MDP representing non-stationary domains.

This paper uses ASP, with the action language BC+ as a high-level layer on top of it, as an elaboration-tolerant formalism for representing a nonstationary MDP, then extracts the reachable parts at any given instant as an MDP that can be solved using conventional RL techniques. The discussion of their translation from an MDP to BC+ to ASP is a useful reference. They also have a brief survey of other systems that combine ASP and RL in some form.

2020-06-07

Addressing the fundamental tension of PCGML with discriminative learning

I. Karth, A.M. Smith. Procedural Content Generation Workshop, 2019.

Tags: procedural content generation, machine learning, games

[doi] [pdf]

Abstract:

Procedural content generation via machine learning (PCGML) is typically framed as the task of fitting a generative model to full-scale examples of a desired content distribution. This approach presents a fundamental tension: the more design effort expended to produce detailed training examples for shaping a generator, the lower the return on investment from applying PCGML in the first place. In response, we propose the use of discriminative models, which capture the validity of a design rather the distribution of the content, trained on positive and negative example design fragments. Through a modest modification of WaveFunctionCollapse, a commercially-adopted PCG approach that we characterize as using elementary machine learning, we demonstrate a new mode of control for learning-based generators. We demonstrate how an artist might craft a focused set of additional positive and negative design fragments by critique of the generator's previous outputs. This interaction mode bridges PCGML with mixed-initiative design assistance tools by working with a machine to define a space of valid designs rather than just one new design.

Addresses the controllability issue in machine-learning based procedural content generation systems. Such systems have the problem that the only real way to control them in the basic formulation is via the dataset: you feed one in, and the system generates stuff similar to the dataset, using whatever data-driven generative method (Markov models, GANs, etc.) you've chosen. If you want to steer the results beyond that, such methods usually aren't great. You could try to steer indirectly by adding and removing data points from the original dataset, in hopes that the generative method that takes it as input will produce less stuff similar to the removed points and more stuff similar to the added points. This paper proposes doing that kind of intervention more directly with discriminative learning: you give it positive and negative examples of outputs, and the generative space is morphed so it avoids the negative examples and includes the positive examples. The way they do this is fairly specific to the WaveFunctionCollapse generative algorithm they use, though.

2020-06-05

From logicism to proceduralism (an autobiographical account)

G.S. Tseytin. Proc. Algorithms in Modern Mathematics and Computer Science, 1979.

Tags: philosophy, mathematics, artificial intelligence, logic

[doi]

Abstract:

This is a story of how I changed my views from the belief that good knowledge must always be represented as a set of logical statements, within a suitable mathematical model of reality, to my present opinion that knowledge is basically algorithmic.

An account from a mathematical logician — incidentally, the inventor of the "Tseytin transformation" for converting boolean formulas to CNF without exponential blowup — of his loss of faith in logicism and increasing interest in procedural representations of knowledge coming out of computer science and artificial intelligence. He discusses his 1960s work that was largely against that direction: attempts to avoid "algorithmization" through higher-level, logical representations that could be treated in a more standard mathematical way, and then automatically "algorithmized" when needed. He gives a few reasons for his gradual change of views: weakness of the concept of "mathematical world", slow progress in automated theorem proving and declarative programming, and his practical experience in developing applied programming languages. It's interesting to read this with the benefit of knowing, in hindsight, that those procedural knowledge representations arguably didn't fare too well either.

2020-06-04

Languages for computer music

R.B. Dannenberg. Frontiers in Digital Humanities, 2018.

Tags: computer music, programming languages

[doi]

Abstract:

Specialized languages for computer music have long been an important area of research in this community. Computer music languages have enabled composers who are not software engineers to nevertheless use computers effectively. While powerful general-purpose programming languages can be used for music tasks, experience has shown that time plays a special role in music computation, and languages that embrace musical time are especially expressive for many musical tasks. Time is expressed in procedural languages through schedulers and abstractions of beats, duration and tempo. Functional languages have been extended with temporal semantics, and object-oriented languages are often used to model stream-based computation of audio. This article considers models of computation that are especially important for music programming, how these models are supported in programming languages, and how this leads to expressive and efficient programs. Concrete examples are drawn from some of the most widely used music programming languages.

A survey of domain-specific languages for computer music, from someone who's designed a few himself. Includes two useful sections up front summarizing "why music is different" (i.e. why are DSLs needed instead of just using a general-purpose language?), and "a good music programming language should be a good language" (i.e. why are DSLs needed instead of a more restricted music notation that isn't a full programming language?). Then getting into specifics, one of the main focuses is how different languages model time and scheduling, along with their allegiences to various programming paradigms (e.g. functional, object-oriented, or dataflow). The paper then ends with overviews of a few major languages: the Music N family, Max/MSP and Pd, SuperCollider, ChucK, and Faust. Dannenberg's own language Nyquist is also discussed as an example in various places. Concludes with a brief discussion of how different language paradigms fit well or less well with challenges in computer music.

2020-06-02

Grounding and solving in answer set programming

B. Kaufmann, N. Leone, S. Perri, T. Schaub. AI Magazine, 2016.

Tags: logic programming

[doi]

Abstract:

Answer set programming is a declarative problem solving paradigm that rests upon a workflow involving modeling, grounding, and solving. While the former is described by Gebser and Schaub (2016), we focus here on key issues in grounding, or how to systematically replace object variables by ground terms in a effective way, and solving, or how to compute the answer sets of a propositional logic program obtained by grounding.

An overview of the grounding and solving processes employed by the Potassco and DLV systems, as of 2016. More detailed and theoretical than the systems' manuals in some respects, but more readable by a nonspecialist than the authors' technical papers on the subject. The grounding part gives an idea of how these systems' grounders optimize compared to a naive instantiation of all ground terms, and how they deal with function symbols, which could make a program ungroundable/infinite. The short of it on the latter point is: there are classes of programs that are provably groundable to finite representations, and previous versions of the Potassco team's grounder (gringo) handled only those, but now both the Potassco and DLV folks use a semi-decidable procedure that can handle any finitely ground program (for more on this, see M. Alviano et al., Function Symbols in ASP, in the Gelfond Festschrift, 2011). The solving part gives a readable and concise overview of current state-of-the-art architectures, which use conflict-directed search adapted in part from the SAT-solving community, but with ASP-specific methods for identifying nogoods.

2020-05-30

Answer set; programming?

P. Cabalar. Logic Programming, Knowledge Representation, and Nonmonotonic Reasoning (Gelfond Festschrift), 2011.

Tags: programming languages, logic programming

[doi] [pdf]

Abstract:

Motivated by a discussion maintained by Michael Gelfond and other researchers, this short essay contains some thoughts and refections about the following question: should ASP be considered a programming language?

A set of reflections spurred by a discussion that took place at Commonsense 2007, where Gelfond "defend[ed] the idea that ASP was exclusively a logical knowledge representation language, *not* a programming language". One objection: ASP cannot really implement algorithms in a general way, but only solve specifications – and finite ones at that, not generalized to arbitrary n like many algorithms can be. Another: time is in a sense "virtual" in ASP, as everything happens simultaneously in the solving step. There is an example of a Turing machine encoding in Prolog, another traditional way of showing that something is a programming language in the sense of a Turing-complete programming language. Can the same be done in ASP? Cabalar argues no, traditionally, although first-order ASP extended with function symbols might change the picture – albeit not the "virtual time" part of it, where Prolog really *executes* the Turing machine in a way even this extended ASP arguably doesn't. Cabalar concludes: "ASP is not a programming paradigm in a strict sense, but a formal specification language instead. Still, we find that the term Answer Set *programming* can be more vaguely used to refer to the craft (and perhaps in the future, to the methodologies) for developing an efficient and elaboration tolerant formal representation of a given problem".

2020-02-19

A novelty detection approach to classification

N. Japkowicz, C. Myers, M. Gluck. Intl. Joint Conf. Artificial Intelligence, 1995.

Tags: novelty detection, autoencoder, machine learning

[pdf]

Abstract:

Novelty Detection techniques are concept-learning methods that proceed by recognizing positive instances of a concept rather than differentiating between its positive and negative instances. Novelty Detection approaches consequently require very few, if any, negative training instances. This paper presents a particular Novelty Detection approach to classification that uses a Redundancy Compression and Non-Redundancy Differentiation technique based on the [Gluck & Myers, 1993] model of the hippocampus, a part of the brain critically involved in learning and memory. In particular, this approach consists of training an autoencoder to reconstruct positive input instances at the output layer and then using this autoencoder to recognize novel instances. Classification is possible, after training, because positive instances are expected to be reconstructed accurately while negative instances are not. The purpose of this paper is to compare HIPPO, the system that implements this technique, to C4.5 and feedforward neural network classification on several applications.

A fairly simple but clever way of using a non-linear (neural network) autoencoder as a one-class classifier for the purpose of novelty detection. The one-class classification framing of novelty detection assumes that you have a bunch of examples of normal, non-novel behavior, and you want to learn a classifier from these examples of just the one class. What this paper does is start by training an autoencoder to reconstruct examples of the class. The basic idea is that the autoencoder should learn common structural regularities in this class, since it's effectively forced to learn a compressed representation, in order to reconstruct examples through a bottleneck. But what's key is that it should *not* be very good at reconstructing things that don't share the same structural regularities as the examples it was trained on. So reconstruction error is then used as the measure of novelty: If the autoencoder reconstructs an example badly, it's novel.

2020-01-06

On links: Exercises in style

S. Mason, M. Bernstein. ACM Hypertext Conf., 2019.

Tags: hypertext, poetics

[doi]

Abstract:

Links are the most important new punctuation mark since the invention of the comma, but it has been years since the last in-depth discussions of link poetics. Taking inspiration Raymond Queneau's Exercices De Style, we explore the poetics of contemporary link usage by offering exercises in which the same piece of text is divided and linked in different ways. We present three different exercises—varying the division of a text into lexia, varying links among lexia, and varying links within lexia—while pointing toward potential aesthetic considerations of each variation. Our exercises are intended descriptively, not prescriptively, as a conversational starting point for analysis and as a compendium of useful techniques upon which artists might build.

I like the bold claim in the first sentence! The paper starts by noting that there was more discussion of the poetics of hypertext links prior to the web era, with quite little recently. They suggest a few possible reasons: the richer link models of pre-web hypertext, the rise and fall of poststructuralist literary criticism in the meantime, and a feeling for a while that hypertext might be a mere stepping stone on the way to graphical videogames and VR as the next big new-media literary forms. Returning to the task, they explore three types of variation in links that authors might consider. For the first, how to divide lexia, they give examples of dividing: by sentences/paragraphs, by dramatic beats, along a protagonist's thoughtline, along an antagonist's thoughtline, as mid-sentence transitions, along a path of intensification, along a path of accelerating time, into a "pointilism" of small lexia, and following locative forces. For the second, how to link among lexia, they give four roles for links: timeshift, recursus, renewal, and annotation. And they further analyze these by the links' position within sentences, their position within lexia, the parts of speech that are linked, the presence of an implied dialog between link and destination, whether they are explicitly diegetic, whether they are integral or supplemental to the text, their role as annotations, and their form as calligraphic, sculptural, or automatic. Finally, they look at some ways that links are marked/communicated within lexia: by cycling options, as click-to-advance, as disappearing links, as timed links, as appearing links, as shimmer text, and as stretchtext.

2019-10-30

State aggregation in Monte Carlo tree search

J. Hostetler, A. Fern, T. Dietterich. AAAI Conf. on Artificial Intelligence, 2014.

Tags: artificial intelligence

[pdf]

Abstract:

Monte Carlo tree search (MCTS) algorithms are a popular approach to online decision-making in Markov decision processes (MDPs). These algorithms can, however, perform poorly in MDPs with high stochastic branching factors. In this paper, we study state aggregation as a way of reducing stochastic branching in tree search. Prior work has studied formal properties of MDP state aggregation in the context of dynamic programming and reinforcement learning, but little attention has been paid to state aggregation in MCTS. Our main result is a performance loss bound for a class of value function-based state aggregation criteria in expectimax search trees. We also consider how to construct MCTS algorithms that operate in the abstract state space but require a simulator of the ground dynamics only. We find that trajectory sampling algorithms like UCT can be adapted easily, but that sparse sampling algorithms present difficulties. As a proof of concept, we experimentally confirm that state aggregation can improve the finite-sample performance of UCT.

A fairly formal paper introducing notation for thinking about how to aggregate states in expectimax search in general, and MCTS in particular. They introduce a concept of consistent abstraction, and prove a theorem showing that certain kinds of consistent abstraction preserve optimality when searched over by MCTS, in the sense that searching in the abstracted tree produces the optimal result for the ground tree. The overall formalism is built directly on one introduced for general MDPs by Li, Walsh, and Littman ("Towards a unified theory of state abstraction for MDPs", 2006). They also outline an algorithm for performing MCTS with UCT in the abstracted space while having only a simulator of the ground (un-abstracted) MDP. In their notation, an abstraction is a pair <χ, μ>, where χ is a state-abstraction/state-partitioning function and μ is a weighting function for each partition. The algorithm they derive for performing MCTS/UCT with only a simulator of the ground MDP works without needing μ to be constructed explicitly. However, it still requires the state-abstraction function χ as explicit input. Their theorem specifies conditions under which a particular choice of χ retains optimal MCTS peformance, but doesn't provide a method for finding such a function unless you have the optimal policy. They do have a small empirical experiment with blackjack, where they compute the optimal policy and find a suitable χ from that. They additionally show that, at least in this case, a synthetically "noisy" χ can still provide significant speedup, which maybe suggests ways around the circular dependence on solving the MDP first.

2019-09-10

Objections to Bayesian statistics

A. Gelman. Bayesian Analysis, 2008.

Tags: statistics, Bayesian statistics

[doi]

Abstract:

Bayesian inference is one of the more controversial approaches to statistics. The fundamental objections to Bayesian methods are twofold: on one hand, Bayesian methods are presented as an automatic inference engine, and this raises suspicion in anyone with applied experience. The second objection to Bayes comes from the opposite direction and addresses the subjective strand of Bayesian inference. This article presents a series of objections to Bayesian inference, written in the voice of a hypothetical anti-Bayesian statistician. The article is intended to elicit elaborations and extensions of these and other arguments from non-Bayesians and responses from Bayesians who might have different perspectives on these issues.

A short summary of some of the objections to Bayesian statistics that Gelman, himself a Bayesian, thinks to be strongest. They are a mix of things: worries that Bayesianism tends to disregard careful empirical design and problem-specific analysis in favor of general, automatic methods; dislike of computational methods like MCMC that may or may not converge; and the debate around subjective priors. The paper isn't a survey (it cites only seven papers, while the literature on Bayesianism and its critics is huge), but a useful concise summary of some common points. This serves as the lead discussion-provoking article in a special journal section, with replies by J.M Bernardo, J.B. Kadane, S. Senn, and L. Wasserman, followed by a rejoinder from Gelman, in which he explains why he doesn't himself believe any of his objections are fatal.

2019-08-28

A collection of definitions of intelligence

S. Legg, M. Hutter. Frontiers in Artificial Intelligence and Applications, 2007.

Tags: artificial intelligence

[pdf]

Abstract:

This paper is a survey of a large number of informal definitions of "intelligence" that the authors have collected over the years. Naturally, compiling a complete list would be impossible as many definitions of intelligence are buried deep inside articles and books. Nevertheless, the 70-odd definitions presented here are, to the authors' knowledge, the largest and most well referenced collection there is.

It is what it says: a collection of definitions of intelligence. Mostly just direct quotes, sorted into three categories: collective definitions (18 of them, mostly from dictionaries and encyclopedias), psychologist definitions (35), and AI researcher definitions (18). From these definitions, the authors extract three properties they judge to hold in most of them: intelligence is a property of an individual agent interacting with environments; intelligence is related to an agent's ability to succeed relative to some goal; and intelligence requires the agent being able to adapt to environments and goals. This leads to their own proposed definition, centered around generality: "Intelligence measures an agent's ability to achieve goals in a wide range of environments". The authors elaborate their own definition into a mathematical theory elsewhere, in the paper "A formal measure of machine intelligence". I'm intrigued by their proposal, but somehow I find these 71 collected definitions not as interesting as I thought I would; hard to put my finger on what's missing.

2019-04-16

Integer multiplication in time O(n log n)

D. Harvey, J. van der Hoeven. HAL preprint, 2019.

Tags: algorithms, computational complexity

[pdf]

Abstract:

We present an algorithm that computes the product of two n-bit integers in O(n log n) bit operations.

A new asymptotically fastest algorithm for integer multiplication, hitting a bound that's also conjectured to be a lower bound (but that isn't proven). Following the proof is rather above my level of mathematical knowledge. But section 1.1 has a nice moderately technical overview of what they see as the four main ways to get an asymptotically fast integer multiplication algorithm using an FFT, pulling out what the key ideas are in each and how they influence the present algorithm. In an interview with Quanta Magazine (which is where I ran across this paper), van der Hoeven puts it more colorfully: "We use [the fast Fourier transform] in a much more violent way, use it several times instead of a single time, and replace even more multiplications with additions and subtractions".

2019-04-15

MinCaml: A simple and efficient compiler for a minimal functional language

E. Sumii. Workshop on Functional and Declarative Programming in Education, 2005.

Tags: programming languages, functional programming, compilers, CS education

[doi] [pdf]

Abstract:

We present a simple compiler, consisting of only 2000 lines of ML, for a strict, impure, monomorphic, and higher-order functional language. Although this language is minimal, our compiler generates as fast code as standard compilers like Objective Caml and GCC for several applications including ray tracing, written in the optimal style of each language implementation. Our primary purpose is education at undergraduate level to convince students—as well as average programmers—that functional languages are simple and efficient.

I ran across this while looking for a small example compiler to show my programming languages class to make some of the concepts we've covered at a high level more concrete. In particular, a compiler that somewhat cleanly implements all the main phases, from lexing and parsing through to an IR, register allocation, and code generation. This one does do that. It is not *quite* minimalist enough for what I was hoping to use it for, but it's the closest I've seen, and I think if I modify it to output some nicer visualizations of the passes it might work well for that next year. I particularly like that the top-level Main module has a part where it runs the textbook compiler passes as just literally a cascade of function calls. Most of the passes themselves are fairly clean too, tree-to-tree transforms in the usual ML pattern-matching-on-ADTs style. The paper has a concise pass-by-pass description of the strategy taken for each and why.

2019-02-04

Entropic GANs meet VAEs: A statistical approach to compute sample likelihoods in GANs

Y. Balaji, H. Hassani, R. Chellappa, S. Feizi. arXiv preprint, 2018.

Tags: machine learning

[pdf]

Abstract:

Building on the success of deep learning, two modern approaches to learn a probability model of the observed data are Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs). VAEs consider an explicit probability model for the data and compute a generative distribution by maximizing a variational lower-bound on the log-likelihood function. GANs, however, compute a generative model by minimizing a distance between observed and generated probability distributions without considering an explicit model for the observed data. The lack of having explicit probability models in GANs prohibits computation of sample likelihoods in their frameworks and limits their use in statistical inference problems. In this work, we show that an optimal transport GAN with the entropy regularization can be viewed as a generative model that maximizes a lower-bound on average sample likelihoods, an approach that VAEs are based on. In particular, our proof constructs an explicit probability model for GANs that can be used to compute likelihood statistics within GAN's framework. Our numerical results on several datasets demonstrate consistent trends with the proposed theory.

Soheil Feizi presented this work today at the American University CS colloquium, so this writeup is based more on the talk than the paper (but the paper content seems similar). GANs and VAEs are both generative models, but VAEs build an explicit probability model, while GANs don't. This means GANs can't give you things like the likelihood that a given query artifact came from the space the GAN is (implicitly) modeling. On the other hand, GANs allow for higher-quality sampling from the model (in the sense that sampling produces high-quality individual artifacts), which is why all the high-profile demos of things like "deep fakes", plus various kinds of generative art and music in recent years, use GANs. This paper proposes that the best of both worlds is possible! The one-sentence summary of why: If you take the Wasserstein GAN reformulation of GANs and add a specific entropic regularization term, it can be made essentially a kind of VAE (they show), which then gives you likelihoods and other nice things without losing GAN-level output quality.

2018-11-19

Learning and using models

T. Hester, P. Stone. Reinforcement Learning: State of the Art, 2012.

Tags: machine learning, reinforcement learning

[doi] [pdf]

Abstract:

As opposed to model-free RL methods, which learn directly from experience in the domain, model-based methods learn a model of the transition and reward functions of the domain on-line and plan a policy using this model. Once the method has learned an accurate model, it can plan an optimal policy on this model without any further experience in the world. Therefore, when model-based methods are able to learn a good model quickly, they frequently have improved sample efficiency over model-free methods, which must continue taking actions in the world for values to propagate back to previous states. Another advantage of model-based methods is that they can use their models to plan multi-step exploration trajectories. In particular, many methods drive the agent to explore where there is uncertainty in the model, so as to learn the model as fast as possible. In this chapter, we survey some of the types of models used in model-based methods and ways of learning them, as well as methods for planning on these models. [...]

A survey of methods for acquiring models from experience in RL environments where the model isn't given up front. This boils down to learning an approximation of an MDP, which then allows you to do the kind of planning that you can do in domains where you were given an MDP, such as computing value functions and policies, or doing MCTS search. Even for just doing RL, these methods generally promise lower sample complexity. Besides learning a model which you can do normal model-based things with, some of the algorithms interleave learning the models and using them in more specific ways suitable to the model/approximation. The are brief summaries of the main algorithms in this area, such as DYNA, DYNA-2, R-MAX, TEXPLORE, etc., with a big list of them in Table 2. (The authors of this survey are the authors of TEXPLORE.) A particularly interesting section not always covered elsewhere is Section 6 on algorithms that take advantage of factored domains, which can make model learning scale up more tractably, at least on domains that factor nicely. Section 7 has a discussion of exploration strategies, which are pretty key as well, and which shade into things discussed in other bodies of literature, like curiosity and intrinsically motivated RL.

2018-10-17

Kolmogorov: Life and creative activities

A.N. Shiryaev. Annals of Probability, 1989.

Tags: history, mathematics, information theory

[doi]

Excerpt:

Topics in the theory of trigonometric series, theory of measure and sets, studies in the theory of integration, approximation theory, constructive logic, topology, theory of superposition of functions and Hilbert's 13th problem, topics in classical mechanics, ergodic theory, theory of turbulence, diffusion and patterns (models) in the dynamics of populations, papers on the foundations of probability theory, limit theorems, theory of stochastic (Markov, stationary, branching, ...) processes, mathematical statistics, theory of algorithms, information theory... —even this is hardly a complete list of all the branches of science in which Andrei Nikolaevich obtained fundamentally important results, which determined the state of many fields of 20th century mathematics and possible directions for their development.

Very informative 79-page biopic article on Andrei Kolmogorov, a Soviet mathematician probably best known in computer science for his work on information theory, especially in pioneering algorithmic information theory. The article intermixes biographical materials with a summary of Kolmogorov's major mathematical results. The author (Shiryaev) was one of Kolmogorov's students, and quite a degree of admiration is noticeable throughout, on a personal and not only intellectual level. I won't attempt to summarize the contents here; there are many, many interesting pieces to pick out. Maybe I'll write a blog post with some of the highlights. The article is followed by a comprehensive bibliography listing Kolmogorov's 518 publications.

2018-09-25

Can citation indexing be automated?

E. Garfield. Symp. Statistical Association Methods for Mechanized Documentation, 1964.

Tags: artificial intelligence, library science, natural language processing

[pdf]

Abstract:

The main characteristics of conventional language-oriented indexing systems are itemized and compared to the characteristics of citation indexes. The advantages and disadvantages arc discussed in relation to the capability of the computer automatically to simulate human critical proccaws reflected in the act of citation. It is shown that a considerable standardization of document presentations will be necessary and probably not achievable for many years if we are to achieve automatic referencing. On the other hand, many citations, now fortuitously or otherwise omitted, might be supplied by computer analyses of text.

Garfield started the now-very-influential Science Citation Index (which invented the "impact factor") earlier in 1964. A few months later, he presented this paper asking how far automated indexing could go. Could it go beyond mechanical counting of citations, and intelligently analyze the relationships between papers? Could it even make explicit manual citations unnecessary? His answer to the titular question is: fully automatic referencing would be difficult, but "many citations, now fortuitously or otherwise omitted, might be supplied by computer analyses of text". I can't help but lament that tools for *even that* don't really exist now, 54 years later. A side bit of this paper is that he lists 15 different reasons a paper might include a citation, transcribed here for reference: 1) Paying homage to pioneers, 2) Giving credit for related work (homage to peers), 3) Identifying methodology, equipment, etc., 4) Providing background reading, 5) Correcting one's own work, 6) Correcting the work of others, 7) Criticizing previous work, 8) Substantiating claims, 9) Alerting to forthcoming work, 10) Providing leads to poorly disseminated, poorly indexed, or uncited work, 11) Authenticating data and classes of fact—physical constants, etc., 12) Identifying original publications in which an idea or concept was discussed, 13) Identifying original publication or other work describing an eponymic concept or term as, e.g., Hodgkin's disease, Pareto's Law, Friedel-Crafts Reaction, etc., 14) Disclaiming work or ideas of others (negative claims), 15) Disputing priority of others (negative homage).

2018-09-24

Automatic differentiation in machine learning: A survey

A.G. Baydin, B.A. Pearlmutter, A.A. Radul, J.M. Siskind. J. Machine Learning Research, 2018.

Tags: machine learning

[pdf]

Abstract:

Derivatives, mostly in the form of gradients and Hessians, are ubiquitous in machine learning. Automatic differentiation (AD), also called algorithmic differentiation or simply "autodiff", is a family of techniques similar to but more general than backpropagation for efficiently and accurately evaluating derivatives of numeric functions expressed as computer programs. AD is a small but established field with applications in areas including computational fluid dynamics, atmospheric sciences, and engineering design optimization. Until very recently, the fields of machine learning and AD have largely been unaware of each other and, in some cases, have independently discovered each other's results. Despite its relevance, general-purpose AD has been missing from the machine learning toolbox, a situation slowly changing with its ongoing adoption under the names "dynamic computational graphs" and "differentiable programming". We survey the intersection of AD and machine learning, cover applications where AD has direct relevance, and address the main implementation techniques. By precisely defining the main differentiation techniques and their interrelationships, we aim to bring clarity to the usage of the terms "autodiff", "automatic differentiation", and "symbolic differentiation" as these are encountered more and more in machine learning settings.

A good overview of what automatic differentiation is, and how it is and might be used in machine learning. Sections 1-3 are essentially an AD tutorial. Section 2 covers how it is equivalent to neither numerical approximation nor to symbolic computation of derivatives, and explains the differences. Figure 2 is a nice summary of these relationships. Section 3 gives a more detailed dive into AD's two "main modes": forward and reverse mode. This part also explains why backpropagation is more or less reverse-mode AD. Section 4 goes into the relationship with gradient-based ML methods in much more detail, and gives AD interpretations of recent ML terms like "computational graph" and "differentiable programming". Section 5 then covers the main ways of implementing it: elemental libraries, preprocessors, and operator overloading. Table 5 provides a big list of such impementations.

2018-09-09

Mission control: A history of the urban dashboard

S. Mattern. Places J., 2015.

Tags: human-computer interaction, smart cities

[doi]

Excerpt:

We know what rocket science looks like in the movies: a windowless bunker filled with blinking consoles, swivel chairs, and shirt-sleeved men in headsets nonchalantly relaying updates from "Houston" to outer space. Lately, that vision of Mission Control has taken over City Hall. NASA meets Copacabana, proclaimed the New York Times, hailing Rio de Janeiro's Operations Center as a "potentially lucrative experiment that could shape the future of cities around the world." The Times photographed an IBM executive in front of a seemingly endless wall of screens integrating data from 30 city agencies, including transit video, rainfall patterns, crime statistics, car accidents, power failures, and more. Futuristic control rooms have proliferated in dozens of global cities.

A look at the urban dashboards becoming popular as city-planning / city-monitoring tools, such as London's "City Dashboard", Rio de Janeiro"s "Operations Center", and Baltimore's "CitiStat", through a historical look at where they came from. Mattern discusses their relationship to the history of automobile dashboards and airplane cockpits, which developed the initial concept of a centralized electronic control center. The concept as adopted by urban dashboards she anchors in NASA's Mission Control Center, whose images filtered through pop culture. Specific points of reference in the story from there include Chile's Project Cybersyn (an attempt at computerized socialist planning which featured a deliberately futuristic-looking central control room), the ubiquitous Bloomberg Terminals in finance, and internal dashboards at large corporations that became popular with the rise of data science. She also traces a separate connection to an analog attempt to give macroscale overview of a city in one place, Patrick Geddes's 1890s use of a tower that could give a bird's eye view of Edinburgh, juxtaposed with a camera obscura on the same site that could illustrate more specific details.

2018-09-07

Multi-agent diverse generative adversarial networks

A. Ghosh, V. Kulharia, V. Namboodiri, P.H.S. Torr, P.K. Dokania. Computer Vision and Pattern Recognition Conf., 2018.

Tags: machine learning, multiagent systems

[pdf]

Abstract:

We propose MAD-GAN, an intuitive generalization to the Generative Adversarial Networks (GANs) and its conditional variants to address the well known problem of mode collapse. First, MAD-GAN is a multi-agent GAN architecture incorporating multiple generators and one discriminator. Second, to enforce that different generators capture diverse high probability modes, the discriminator of MAD-GAN is designed such that along with finding the real and fake samples, it is also required to identify the generator that generated the given fake sample. Intuitively, to succeed in this task, the discriminator must learn to push different generators towards different identifiable modes. We perform extensive experiments on synthetic and real datasets and compare MAD-GAN with different variants of GAN. We show high quality diverse sample generations for challenging tasks such as image-to-image translation and face generation. In addition, we also show that MAD-GAN is able to disentangle different modalities when trained using highly challenging diverse-class dataset (e.g. dataset with images of forests, icebergs, and bedrooms). In the end, we show its efficacy on the unsupervised feature representation task.

A population-based approach to forcing GANs to explore more of a generative space, by having multiple generators instead of one (though still the one discriminator). The discriminator is responsible for pushing these generators into different parts of the space, lest they just end up as redundant copies of the exact same thing. This is done by requiring the discriminator to discriminate which generator produced a given output, in addition to its usual real/fake discrimination duties, giving it incentive to push the different generators into different modes. (There's an experiment in the paper with an unmodified discriminator, MA-GAN, that shows redundant generators do appear otherwise.) They have some experiments on synthetic datasets under the assumption that the GAN's goal should be to match a true underlying distribution; these use the two metrics of KL-divergence from the synthetic distribution, and the number of its modes discovered. One question, posed by our visiting PhD student Simo Linkola, and perhaps a bit philosophical but I think not irrelevant: Is this really a multiagent system? The multiple generators are really not doing much; almost all the work here is being done by the modified discriminator that orchestrates them.

2018-09-05

Finding meaning in abstract games: A deep reading of Sage Solitaire

M. Treanor. Joint Conf. DiGRA/FDG, 2016.

Tags: games

[pdf]

Abstract:

This paper presents a methodology for discovering and explaining how games with very few thematic assets (or abstract games) are meaningful to players through rules and dynamics. Through the process of implementing play strategies as computer code, and then running simulations of the game being played, insights about how a player might think about and experience playing the game are revealed. These insights are compiled into interpretations of the themes and meanings that can be found in the abstract game. The paper then applies the methodology to perform a deep reading of the single player digital card game Sage Solitaire.

How might someone find meaning in abstract games with few representational elements, such as the mobile game Sage Solitaire? Treanor argues such games are deeply meaningful to many people, and not simply because they're trifling pasttimes or traditions. He attempts to answer that question by looking in detail at how players might encounter the things that such representation-lite games do have, their rules and dynamics. Rules can often simply be read in a manual or tutorial, so Treanor argues the real key here is how players interact with and make meaning out of abstract games' dynamics. He approaches that investigation using a "code as theory" method that builds and inspects little AI bots playing the game, whose goal (unlike most game-playing bots) is to "use computational techniques to inform a humanistic understanding of how players experience games", rather than necessarily to maximize winning. These models are aimed at understanding how a player would interact with the dynamics of the game by simulating certain strategies and noting the resulting dynamics. Therefore they tend to be whitebox models, "what if the player always did X?", rather than something like a machine-learned game-playing bot.

2018-07-02

On embeddings as alternative paradigm for relational learning

S. Dumančić, A. García-Durán, M. Niepert. Statistical Relational AI Workshop, 2018.

Tags: machine learning, inductive logic programming

[pdf]

Abstract:

Many real-world domains can be expressed as graphs and, more generally, as multi-relational knowledge graphs. Though reasoning and learning with knowledge graphs has traditionally been addressed by symbolic approaches, recent methods in (deep) representation learning has shown promising results for specialized tasks such as knowledge base completion. These approaches abandon the traditional symbolic paradigm by replacing symbols with vectors in Euclidean space. With few exceptions, symbolic and distributional approaches are explored in different communities and little is known about their respective strengths and weaknesses. In this work, we compare representation learning and relational learning on various relational classification and clustering tasks and analyse the complexity of the rules used implicitly by these approaches. Preliminary results reveal possible indicators that could help in choosing one approach over the other for particular knowledge graphs.

A paper from the statistical relational learning (SRL) community taking stock of what the recent success of representation-learning methods means for them, i.e. the current pros and cons of the two approaches. Specifically, SRL here means inductive logic programming (ILP), while representation-learning means approaches (mainly neural-network ones) that learn to embed structured data into a vector space, along the lines of the various [x]2vec techniques. This paper focuses on benchmark inference tasks on large structured knowledge graphs such as WebKB. It finds preliminarily that the recent embedding methods are competitive or state-of-the-art in graph completion, but lag behind ILP methods in other inference tasks. Due to the difficulty of running comprehensive benchmarks that measure the right thing, the specific results here are probably less important (as the paper admits) than the work towards bringing together these two approaches so they can be viewed as two different tools in a toolbox and compared on particular problems, while so far they've been developed and used mainly by different communities.

2018-06-07

The abstract organism: Towards a prehistory for a-life art

M. Whitelaw. Leonardo, 2001.

Tags: alife, electronic art, art history

[doi]

Abstract:

The author examines historical precedents for contemporary art practice using artificial life, in particular in the work of Paul Klee and Kasimir Malevich. Similarities are identified between artificial life and the philosophical tradition of organicism; specific examples from Klee and Malevich indicate that those artists were engaged in a form of creative organicist thought that imagined the realization of living structures in artificial media.

An interesting "prehistory". One can always find precursors to artistic movements, but this one seems like it might have some depth (only touched on in the brief article). By way of motivation, the paper argues that although a-life historically grew out of computing (in large part artificial intelligence) and views itself as future-oriented, this gives it a kind of ahistoricism that disconnects it more than warranted from ideas in the history of art. The specific point of connection is built around Langton's view that a-life is about the organization of matter in an artificial medium, which is then related to Klee and Malevich's interest in formal internal organization and dynamics. Klee is read as "exploring a kind of artificial physics for the picture plane — or more, an artificial cosmogeny", due to building a formal pictoral language abstracted from dynamics in nature. Evocative, but maybe a bit of a stretch? The connection to Malevich seems much more solid, reading Suprematism's interest in building a new "non-objective" reality from the internal dynamics of form as related to a-life's goals. Malevich's rhetoric around "technical organisms" definitely sounds a bit a-lifey, although it comes in his writings that are pretty sci-fi (the "Suprematist machine" will leave earth, "not by means of engines... but through the smooth harnessing of form").

2018-05-30

Measuring the intrinsic dimension of objective landscapes

C. Li, H. Farkhoor, R. Liu, J. Yosinski. Intl. Conf. Learning Representations, 2018.

Tags: machine learning

[pdf]

Abstract:

Many recently trained neural networks employ large numbers of parameters to achieve good performance. One may intuitively use the number of parameters required as a rough gauge of the difficulty of a problem. But how accurate are such notions? How many parameters are really needed? In this paper we attempt to answer this question by training networks not in their native parameter space, but instead in a smaller, randomly oriented subspace. We slowly increase the dimension of this subspace, note at which dimension solutions first appear, and define this to be the intrinsic dimension of the objective landscape. The approach is simple to implement, computationally tractable, and produces several suggestive conclusions. Many problems have smaller intrinsic dimensions than one might suspect, and the intrinsic dimension for a given dataset varies little across a family of models with vastly different sizes. This latter result has the profound implication that once a parameter space is large enough to solve a problem, extra parameters serve directly to increase the dimensionality of the solution manifold. Intrinsic dimension allows some quantitative comparison of problem difficulty across supervised, reinforcement, and other types of learning where we conclude, for example, that solving the inverted pendulum problem is 100 times easier than classifying digits from MNIST, and playing Atari Pong from pixels is about as hard as classifying CIFAR-10. In addition to providing new cartography of the objective landscapes wandered by parameterized models, the method is a simple technique for constructively obtaining an upper bound on the minimum description length of a solution. A byproduct of this construction is a simple approach for compressing networks, in some cases by more than 100 times.

An intriguing way to quantify complexity of solutions that's designed to avoid some of the problems of raw parameter counting, which has longstanding usage for regularization but seems to not necessarily work well as a metric when applied to neural networks with tons of parameters. Although they refer to this as "difficulty" of problems sometimes, it's very tied up with a specific learning algorithm, so is really difficulty of a particular problem *for* a particular neural-network architecture and training algorithm. They acknowledge this in the paper, though, and that learner-specificity has pros and cons. The method is pretty simple, if a bit computationally intensive: restrict neural nets to learning in random subspaces of their usual parameter spaces, but otherwise use the same learning algorithm they would normally use. For example, starting with a 1-dimensional subspace is equivalent to doing line search in random directions. Increase the dimension of this subspace until the network can successfully learn a solution; call that dimension at which a solution appears the intrinsic dimension of the problem (for this learner). As they note, there are many possible applications of such a measure.

2018-05-24

A trans-Canada computer communications network

O.M. Solandt et al. Science Council of Canada Report, 1971.

Tags: networking, politics

[pdf]

Excerpt:

In the sector of service to users, there is one Major Project which would dramatically improve the quality and quantity of computer services available to all Canadians and would greatly assist Canadian industry. It is the organization of a nationwide system of computer communications networks. To achieve this goal a positive, vigorous Canadian policy is urgently needed. A "laissez-faire" attitude will eventually result in the supply of most computing and information services via spur lines from U.S. computer commu nications networks. Such an outcome is completely unacceptable on economic and social grounds. [...] The system of networks should not be allowed to practise "cream-skimming" by concentrating exclusively on the densely populated, highly profitable regions of Canada. It must link all important centres in Canada in order to bring computing and information services to the greatest possible number of Canadians.

An early report (via Daniel Joseph) worried about the impact on Canada of the then rapidly developing proto-internet networks, recommending a strong Canadian investment in a national network lest major Canadian cities simply become spokes on a US-controlled network with little scientific, political, or economic role in the network's development. Mostly interesting to me because 1971 seems like a fairly early date for a number of the topics discussed; ArpaNet had just been formed in 1969, but its implications were apparently already clear to Canadian scientists. Among other things: they worry that given Canada's geography, economic relationship to the U.S., and the status of ArpaNet as a military/private/university partnership of exclusively American institutions, Canadian universities and companies will be cut out of a significant role in modern networking; that "the Canadian computer service industry is vulnerable to the dumping of surplus computer capacity from abroad"; that all data between Canadian cities would end up routed through the U.S.; and that future uses of the network for things such as hospital records and telemedicine (!) would be hampered without nationwide Canadian reach. Also some specific numbers and diagrams in support of these claims/recommendations.

2018-05-08

Vagueness, rationality and undecidability: A theory of why there is vagueness

M.A. Changizi. Synthese, 1999.

Tags: natural language processing, formal languages, logic

[doi] [pdf]

Excerpt:

Vagueness is not undecidability, but undecidability does enter into an explanation of why there is vagueness. My theory, called the Undecidability Theory of Vagueness, explains vagueness largely as a result of the fact that we are computationally bound. Vagueness is not due to any particularly human weakness, but due to a weakness that any computationally bound agent possesses; even HAL from 2001: A Space Odyssey will probably experience vagueness. Furthermore, I will argue that vagueness is good for you. I will do so by showing that if you were a computationally bound, rational alien agent given the task of figuring out what our natural language predicates mean, you would very probably end up with vagueness. That is, unless two highly plausible hypotheses are false, it would be rational for you—i.e., in your best interest—to choose concepts in such a way that they are vague. Given that you are computationally bound, avoiding vagueness brings in greater costs than accepting it.

This paper argues that vagueness in natural language can be explained purely as a property of formal systems. That's intriguing because vagueness is often taken as evidence against formal models being a good representation of language, given the seeming mismatch between the precise semantics of logic and the vagueness of real human languages. Thus, alternatives are proposed, such as: humans must use some kind of more "subsymbolic" or "fuzzy" modes of computation, or perhaps the computational model of language/cognition is fundamentally wrong. Natural language and human cognition aren't my area, but the rest of this argument is interesting regardless of whether it empirically explains human language or not. It argues that vagueness is a natural property of formal systems that ought to arise given three conditions: 1) the system is no more powerful than the Church–Turing thesis allows; 2) predicate meaning is extensionally determined by execution of an algorithm; and 3) such algorithms can be chosen from the space of all possible algorithms. The paper further argues that this situation is in a sense in an agent's "best interest", specifically that an agent meeting condition (1) can only avoid vagueness by giving up either (2) or (3), which would be a worse cure than the disease of vagueness. There are some intermediate concepts and technical results developed to get to this conclusion that I haven't yet read closely enough to follow.

2018-05-02

Practical reinforcement learning in continuous spaces

W.D. Smart, L.P. Kaelbling. Intl. Conf. Machine Learning, 2000.

Tags: machine learning, reinforcement learning

[pdf]

Abstract:

Dynamic control tasks are good candidates for the application of reinforcement learning techniques. However, many of these tasks inherently have continuous state or action variables. This can cause problems for traditional reinforcement learning algorithms which assume discrete states and actions. In this paper, we introduce an algorithm that safely approximates the value function for continuous state control tasks, and that learns quickly from a small amount of data. We give experimental results using this algorithm to learn policies for both a simulated task and also for a real robot, operating in an unaltered environment. The algorithm works well in a traditional learning setting, and demonstrates extremely good learning when bootstrapped with a small amount of human-provided data.

I was prompted to revisit earlier work on function approximators in RL by a recent blog post, "Deep Reinforcement Learning Doesn't Work Yet", which discusses instability of deep RL. Earlier work is much more suspicious of the soundness of sticking function approximators into RL algorithms in the first place. Section 2 of this paper has a nice summary of some of the reasons that "using value function approximation with Q-learning is not simply a case of substituting in a supervised learning algorithm for the Q-value lookup table". Specifically, the Q-values are iteratively estimated, so feedback effects cause errors to compound, unlike in supervised learning (this seems like it's probably part of the explanation for the extreme run-to-run instability seen in deep RL). The paper ties this to the view that function approximators can really only interpolate, not extrapolate, but may end up doing "hidden extrapolation" when put into more complex settings like approximating inside an RL loop. And, because of the recursive nature, post-hoc fixes like cross-validating to estimate error in different regions don't work either. Since we nonetheless need function approximators to do RL in large state spaces, the paper proposes an algorithm, Hedger, designed to address all that and allow "safe" Q-value approximation. It would be interesting to know whether any insights from the algorithm are relevant to more recent deep-RL style architectures.

2018-04-29

Detection of a cosmic ray with measured energy well beyond the expected spectral cutoff due to cosmic microwave radiation

D.J. Bird et al. Astrophysical J., 1995.

Tags: physics

[pdf]

Abstract:

We report the detection of a 51-joule (320 +/- 90 EeV) cosmic ray by the Fly's Eye air shower detector in Utah. This is substantially greater than the energy of any previously reported cosmic ray. A Greisen-Zatsepin-Kuz'min cutoff of the energy spectrum (due to pion photoproduction energy losses) should occur below this energy unless the highest energy cosmic rays have traveled less than about 30 Mpc. The error box for the arrival direction in galactic coordinates is centered on b=9.6 deg, l=163.4 deg. The particle cascade reached a maximum size near a depth of 815 g/cm^2 in the atmosphere, a depth which does not uniquely identify the type of primary particle.

The writeup of 1991's "oh-my-god particle", an unusually high-energy cosmic ray detected by a facility in Utah. Obviously this is not my research area at all. I'd heard of it before, and someone linked the paper on Twitter, so I read it. It's surprisingly easy to follow, considering that my physics education consists of two undergrad courses and some Wikipedia reading. The gist of why this is interesting, besides just that it had unusually high energy, is that cosmic background radiation won't allow extremely high energy particles to pass unattenuated. The threshold is known as the GZK limit, said in the paper to be somewhere below 100 EeV, apparently now known to be around 50 EeV. The observed particle was around 320 EeV, so must have originated nearby (barring the GZK limit being wrong); the paper estimates within 30 Mpc. That, plus an approximate origin direction, gives some idea for where to look for the object that produced the particle. The paper also includes a discussion of possible sources of error, alternative hypotheses, etc., since this kind of detector is not a completely straightforward device. That discussion might make for an interesting class discussion paper in the right context: it is just nontrivial enough to be interesting, but I think can be followed by an advanced undergraduate who has had an introduction to error propagation and data interpretation.

2018-04-28

Six mathematical gems from the history of distance geometry

L. Liberti, C. Lavor. Intl. Trans. Operational Research, 2016.

Tags: mathematics, distance geometry, history

[doi]

Abstract:

This is a partial account of the fascinating history of distance geometry. We make no claim to completeness, but we do promise a dazzling display of beautiful, elementary mathematics. We prove Heron's formula, Cauchy's theorem on the rigidity of polyhedra, Cayley's generalization of Heron's formula to higher dimensions, Menger's characterization of abstract semimetric spaces, a result of Gödel on metric spaces on the sphere, and Schoenberg's equivalence of distance and positive semidefinite matrices, which is at the basis of multidimensional scaling.

Delivers what it promises: "a partial account of the fascinating history of distance geometry". A lively read full of historical anecdotes and asides (but also some proofs). Ran across this paper because I might have some use for distance geometry and was trying to get up to speed on some of its tools/results. It's not really an overview in that respect, but does have readable summaries of a few major results. The aside in Figure 1 that recounts an anecdote from when Kurt Gödel first presented his incompleteness theorem is also funny, although not about distance geometry ("That is very interesting. You should publish that." commented an audience member after a long pause).

2018-04-10

Theorizing affordances: From request to refuse

J.L. Davis, J.B. Chouinard. Bulletin of Science, Technology & Society, 2016.

Tags: human-computer interaction

[doi]

Abstract:

As a concept, affordance is integral to scholarly analysis across multiple fields—including media studies, science and technology studies, communication studies, ecological psychology, and design studies among others. Critics, however, rightly point to the following shortcomings: definitional confusion, a false binary in which artifacts either afford or do not, and failure to account for diverse subject-artifact relations. Addressing these critiques, this article demarcates the mechanisms of affordance—as artifacts request, demand, allow, encourage, discourage, and refuse—which take shape through interrelated conditions: perception, dexterity, and cultural and institutional legitimacy. Together, the mechanisms and conditions constitute a dynamic and structurally situated model that addresses how artifacts afford, for whom and under what circumstances.

Briefly traces the concept of "affordance", which originated in ecological psychology, was popularized in design studies, and has become common in fields such as HCI and STS. Notes that the concept is sometimes attacked as being rather baggy, as it takes on a range of slippery meanings often left vague enough that commenters have wondered whether it clarifies anything. This paper proposes to sharpen the concept by further analyzing the general ability of things to "afford" into six specific mechanisms through which they do so, plus three contextual aspects accounting for how affordances depend on "for whom and under what circumstances" a thing is affording. The six mechanisms: a thing can request, demand, allow, encourage, discourage, and refuse. The three contextual aspects: perception, dexterity, and cultural/institutional legitimacy.

2018-04-05

How to do research at the MIT AI Lab

D. Chapman (ed.). MIT AI Lab Working Paper, 1988.

Tags: artificial intelligence, research

[pdf]

Abstract:

This document presumptuously purports to explain how to do research. We give heuristics that may be useful in picking up the specific skills needed for research (reading, writing, programming) and for understanding and enjoying the process itself (methodology, topic and advisor selection, and emotional factors).

A collectively written document from '80s MIT AI Lab grad students, including some now pretty well-known folks (Agre, Brooks, Forbus, Kaelbling, etc.). Some bits are MIT-specific or dated, e.g. how to get yourself plugged into the then-important network of hardcopy preprint circulation. But some of it is still pretty relevant and worth a read. A good portion of the practical advice about how to give feedback, choose advisors, choose problems, etc. holds up pretty well. Some advice is maybe in fact better than a good portion of more recent things you'll find. The document has a generally refreshing tone promoting thoughtfulness, interdisciplinarity, reading/studying. A nice antidote to narrower, less reflective views of AI research. Lots of great quotable sentences too.

2018-03-30

Extremely randomized trees

P. Geurts, D. Ernst, L. Wehenkel. Machine Learning, 2006.

Tags: machine learning

[doi]

Abstract:

This paper proposes a new tree-based ensemble method for supervised classification and regression problems. It essentially consists of randomizing strongly both attribute and cut-point choice while splitting a tree node. In the extreme case, it builds totally randomized trees whose structures are independent of the output values of the learning sample. The strength of the randomization can be tuned to problem specifics by the appropriate choice of a parameter. We evaluate the robustness of the default choice of this parameter, and we also provide insight on how to adjust it in particular situations. Besides accuracy, the main strength of the resulting algorithm is computational efficiency. A bias/variance analysis of the Extra-Trees algorithm is also provided as well as a geometrical and a kernel characterization of the models induced.

The popular machine-learning method of random forests is built on the insight that classic decision trees have high variance, which can be reduced by introducing randomization plus averaging. L. Breiman's proposal of bagged decisions trees, T. Ho's proposal of the random subspace method, and L. Breiman's 2001 proposal of random forests, are aimed at this goal, randomizing the construction of decision trees and then averaging the randomized trees, with the randomization/averaging process reducing variance. The core argument of this paper is: maybe more randomization? Specifically, that more randomization might be both better and faster. Random forests retain a key aspect of classical decision trees, choosing optimal splits on each candidate feature. The optimization portion is computationally expensive, and the cut-point ends up only indirectly randomized by the bootstrap resampling. Why not use the original data deterministically without bootstrapping, but explicitly randomize the cutpoints? The Extra-Trees algorithm does this. One parameter, K, chooses the subset of features to consider. The cut-point on each of the K features is chosen randomly, and then the best of the K splits is chosen. In the limit where K=1, this is a "totally random tree" that doesn't use the labels as training data at all, and falls into the category of memory-based "lazy learner" algorithms such as kNN.

2018-03-28

Perspectives of approximate dynamic programming

W.B. Powell. Annals of Operations Research, 2016.

Tags: machine learning, reinforcement learning

[doi] [pdf]

Abstract:

Approximate dynamic programming has evolved, initially independently, within operations research, computer science and the engineering controls community, all searching for practical tools for solving sequential stochastic optimization problems. More so than other communities, operations research continued to develop the theory behind the basic model introduced by Bellman with discrete states and actions, even while authors as early as Bellman himself recognized its limits due to the "curse of dimensionality" inherent in discrete state spaces. In response to these limitations, subcommunities in computer science, control theory and operations research have developed a variety of methods for solving different classes of stochastic, dynamic optimization problems, creating the appearance of a jungle of competing approaches. In this article, we show that there is actually a common theme to these strategies, and underpinning the entire field remains the fundamental algorithmic strategies of value and policy iteration that were first introduced in the 1950's and 60's.

Attempts to draw connections and bring some order to disparate work on decision-making under uncertainty in several different fields. Whether Powell's proposal of "approximate dynamic programming" as *the* unifying framework for decision-making under uncertainty is a good one, I don't know. But I found the paper useful in trying to map out the work done in this area across operations research, control theory, and computer science. My educational background has been entirely from the computer-science perspective on this (i.e., reinforcement learning), so it's useful to get an overview of relevant stuff that's been done in those two other fields. The paper has a nice set of historical references too, especially for the line of research that most closely follows on from Bellman.

2018-03-27

Datafun: A functional Datalog

M. Arntzenius, N.R. Krishnaswami. Intl. Conf. Functional Programming, 2016.

Tags: programming languages, functional programming, logic programming

[doi] [pdf]

Abstract:

Datalog may be considered either an unusually powerful query language or a carefully limited logic programming language. Datalog is declarative, expressive, and optimizable, and has been applied successfully in a wide variety of problem domains. However, most use-cases require extending Datalog in an application-specific manner. In this paper we define Datafun, an analogue of Datalog supporting higher-order functional programming. The key idea is to track monotonicity with types.

Datafun seems interesting to try sometime. The parts of this paper that develop a bunch of PLs theory to define Datafun aren't really my field, but the paper also has a pretty good critical design overview of Datalog. Even though it's proposing a new language intended to improve on Datalog, the paper does that by first taking Datalog seriously on its own terms, and does a good job explaining what Datalog is trying to do, and what its strengths and weaknesses are in doing it. The first sentence of the abstract makes a nice pithy summary of Datalog too.

2018-03-26

Finding alternative musical scales

J.N. Hooker. Intl. Conf. Constraint Programming, 2016.

Tags: computer music, constraint solving

[doi] [pdf]

Abstract:

We search for alternative musical scales that share the main advantages of classical scales: pitch frequencies that bear simple ratios to each other, and multiple keys based on an underlying chromatic scale with tempered tuning. We conduct the search by formulating a constraint satisfaction problem that is well suited for solution by constraint programming. We find that certain 11-note scales on a 19-note chromatic stand out as superior to all others. These scales enjoy harmonic and structural possibilities that go significantly beyond what is available in classical scales and therefore provide a possible medium for innovative musical composition.

Are there alternative music scales that share key formal properties of the classical ones? A systematic search, framed as a constraint-programming problem, answers yes. Refreshingly, although this is a formal/mathematical paper by an operations-research person and published in a constraint-programming conference, it's aware of and builds on previous work of this kind done by experimental composers (discussed in Sections 2-3).

2018-02-26

Seagull: A bird's-eye view of the evolution of technical games research

T.-H.D. Nguyen, E. Melcer, A. Canossa, K. Isbister, M.S. El-Nasr. Entertainment Computing, 2018.

Tags: scientometrics, games

[doi]

Abstract:

Within Entertainment Computing, games research has grown to be its own area, with numerous publication venues dedicated to it. As this area evolves, it is fruitful to examine its overall development—which subcommunities and research interests were present from the start, which have come and gone, and which are currently active—to better understand the research community as a whole and where it may proceed. In this paper, we present a data-driven analysis and interactive visualization tool to shed light on how technical domains within the games research field have evolved from 2000 to 2013, based on publication data from over 8000 articles collected from 48 games research venues, including Entertainment Computing, FDG, AIIDE, and DiGRA. The approach we present is descriptive. We first used data mining algorithms to group related papers into clusters of similar research topics and evolve these clusters over time. We then designed an interactive visualization system, named Seagull, comprised of Sankey diagrams that allow us to interactively visualize and examine the transition and coalescing of different clusters across time. We present our descriptive analysis in this paper and also contribute the visualization interface to allow other researchers to examine the data and develop their own analysis.

A big data-mining study on publications in games research. They quantitatively identify 20 major clusters in the field based on keyword co-occurrence: among others, "game design", "virtual worlds", "narrative", "artificial intelligence", and "human-computer interaction". Some plots tracing how these clusters grow and shrink in different venues over time.

2018-02-14

Propp's morphology of the folk tale as a grammar for generation

P. Gervás. Computational Models of Narrative Workshop, 2013.

Tags: story generation

[pdf]

Abstract:

The semi-formal analysis of Russian folk tales carried out by Vladimir Propp has often been used as theoretical background for the automated generation of stories. Its rigour and its exhaustive description of the constituent elements of Russian folk tales, and the enumeration of the patterns they follow, have acted as inspiration for several story generation systems, both sequential and interactive. Yet most of these efforts have attempted to generalize Propp's account to types of stories beyond the corpus that it arose from. In the process, a number of the valuable intuitions present in the original work are lost. The present paper revisits Propp's morphology to build a system that generates instances of Russian folk tales. Propp's view of the folk tale as a rigid sequence of character functions is phrased as a grammar that can be employed as a plot driver. Unification is used to incrementally build a conceptual representation of discourse by adding to an ongoing draft story actions that instantiate the character functions. Story actions are defined by pre and post conditions on the state of the plot to account for the causal relations crucial to narrative. The potential of the resulting system for providing a generic story generation system is discussed and possible lines of future work are discussed.

A paper computationally revisiting Propp's formalist analysis of Russian folk tales as a generative model. Gervás points out that, although Propp's model is well known and often referenced in story generation, when used as a basis for systems it's usually used at a level that's much simpler than Propp's own analysis was. Section 2 has a survey of previous story generators' use of Proppian models, finding them not all that Proppian. Section 3 then proposes a new generative implementation argued to be more faithful to Propp's analysis. [Added 2018-04-21]: The survey in Section 2 does miss Sheldon Klein's 1970s implementation of a fairly detailed Proppian story generator (admittedly somewhat obscure, and only brought to my attention by James Ryan). Would be interesting to compare Gervás's and Klein's implementations/interpretations of Propp.

2017-11-16

Wordless games: Gameplay as narrative technique

Y.T. Sim, A. Mitchell. Intl. Conf. Interactive Digital Storytelling, 2017.

Tags: interactive narrative, games

[doi]

Abstract:

In this paper, we look at how gameplay can be used to tell stories in a game without the help of words. Through close readings of three wordless games with a strong narrative focus, Journey, Brothers: A Tale of Two Sons and A Bird Story, we explore how gameplay within wordless games can help to convey a narrative. We have identified four techniques by which gameplay is used for storytelling: gameplay as enacting narrative, manipulating player controls for narrative effect, gameplay for exploring narrative setting, and gameplay as time progression. We discuss these techniques in relation to existing concepts of player experience, and suggest ways gameplay can help to circumvent issues of ambiguity in wordless narrative in games.

The category of strongly narrative-driven games that have no dialog or other words is obvious in retrospect, but I hadn't thought of framing it that way. It's more specific than environmental storytelling. Both the identification of this category and pulling out four narrative techniques that it uses are useful contributions of this paper.

2017-10-27

On the notion of interestingness in automated mathematical discovery

S. Colton, A. Bundy, T. Walsh. Intl. J. Human-Computer Studies, 2000.

Tags: discovery systems, mathematics

[doi] [pdf]

Abstract:

We survey five mathematical discovery programs by looking in detail at the discovery processes they illustrate and the success they had. We focus on how they estimate the interestingness of concepts and conjectures and extract some common notions about interestingness in automated mathematical discovery. We detail how empirical evidence is used to give plausibility to conjectures, and the different ways in which a result can be thought of as novel. We also look at the ways in which the programs assess how surprising and complex a conjecture statement is, and the different ways in which the applicability of a concept or conjecture is used. Finally, we note how a user can set tasks for the program to achieve and how this affects the calculation of interestingness. We conclude with some hints on the use of interestingness measures for future developers of discovery programs in mathematics.

A critical overview of how five systems—AM (Lenat), GT (Epstein), Graffiti (Fajtlowicz), Bagai et al's system, and HR (Colton et al)—conceive of "interestingness" when going about trying to discover "interesting" mathematical concepts or conjectures. Also a useful concise overview of what these five systems in general are doing, with some summary tables listing the mathematical domains they work in, how they represent concepts, etc.

2017-10-13

Why most decisions are easy in Tetris—and perhaps in other sequential decision problems, as well

Ö. Simsek, S. Algorta, A. Kothiyal. Intl. Conf. Machine Learning, 2016.

Tags: reinforcement learning, games

[pdf]

Abstract:

We examined the sequence of decision problems that are encountered in the game of Tetris and found that most of the problems are easy in the following sense: One can choose well among the available actions without knowing an evaluation function that scores well in the game. This is a consequence of three conditions that are prevalent in the game: simple dominance, cumulative dominance, and noncompensation. These conditions can be exploited to develop faster and more effective learning algorithms. In addition, they allow certain types of domain knowledge to be incorporated with ease into a learning algorithm. Among the sequential decision problems we encounter, it is unlikely that Tetris is unique or rare in having these properties.

Demonstrates three specific reasons that Tetris is much easier than a fully general RL problem: regularities they call simple dominance, cumulative dominance, and noncompensation. Speculates that such problems are in fact quite common, and proposes that learning algorithms could be more efficient if they exploited such properties. How widely this applies to other games is a bit speculative at this stage. Besides Tetris, they have a preliminary discussion of Backgammon in Section 6.1 ("Is Tetris special?"). Some interesting research avenues suggest themselves.

2017-09-12

The ordinal nature of emotions

G.N. Yannakakis, R. Cowie, C. Busso. Intl. Conf. Affective Computing & Intelligent Interaction, 2017.

Tags: affective computing

[doi] [pdf]

Abstract:

Representing computationally everyday emotional states is a challenging task and, arguably, one of the most fundamental for affective computing. Standard practice in emotion annotation is to ask humans to assign an absolute value of intensity to each emotional behavior they observe. Psychological theories and evidence from multiple disciplines including neuroscience, economics and artificial intelligence, however, suggest that the task of assigning reference-based (relative) values to subjective notions is better aligned with the underlying representations than assigning absolute values. Evidence also shows that we use reference points, or else anchors, against which we evaluate values such as the emotional state of a stimulus; suggesting again that ordinal labels are a more suitable way to represent emotions. This paper draws together the theoretical reasons to favor relative over absolute labels for representing and annotating emotion, reviewing the literature across several disciplines. We go on to discuss good and bad practices of treating ordinal and other forms of annotation data, and make the case for preference learning methods as the appropriate approach for treating ordinal labels. We finally discuss the advantages of relative annotation with respect to both reliability and validity through a number of case studies in affective computing, and address common objections to the use of ordinal data. Overall, the thesis that emotions are by nature relative is supported by both theoretical arguments and evidence, and opens new horizons for the way emotions are viewed, represented and analyzed computationally.

Sometime around 2011 or 2012, back when I lived in Copenhagen and Georgios lived in Malmö, I visited him there, and on the walk to his place from Triangeln station, heard his rather exasperated-sounding explanation of why everyone (or almost everyone) in computer science is doing emotion processing completely wrong. He simply could not understand why people treated emotions as absolute numerical quantities in the face of all evidence to the contrary, with more or less the only thing in favor of that approach being that it was computationally convenient (maybe reason enough for some). This paper is sort of the culmination in print of that view, although he has published elsewhere some subsidiary aspects of his views on the question (e.g. "Don’t classify ratings of affect; Rank them!", IEEE Trans. Affective Computing, 2014). This paper covers the question from a number of angles, from theoretical views on emotion to technical questions of statistical analysis, and would take more than a note of this length to do it justice, so I'll leave it at that for now.

2017-08-26

Grimes' fairy tales: A 1960s story generator

J. Ryan. Intl. Conf. Interactive Digital Storytelling, 2017.

Tags: story generation, history

[doi] [pdf]

Abstract:

We provide the first extensive account of an unknown story generator that was developed by linguist Joseph E. Grimes in the early 1960s. A pioneering system, it was the first to take a grammar-based approach and the first to operationalize Propp’s famous model. This is the opening paper in a series that will aim to reformulate the prevailing history of story generation in light of new findings we have made pertaining to several forgotten early projects. Our study here has been made possible by personal communication with the system's creator, Grimes, and excavation of three obscure contemporaneous sources. While the accepted knowledge in our field is that the earliest story generator was Sheldon Klein’s automatic novel writer, first reported in 1971, we show that Grimes’s system and two others preceded it. In doing this, we reveal a new earliest known system. With this paper, and follow-ups to it that are in progress, we aim to provide a new account of the area of story generation that lends our community insight as to where it came from and where it should go next. We hope others will join us in this mission.

Outcome of a historical rabbit hole James Ryan has been excavating. I believe this was originally going to be a paper about Klein's systems, but then Ryan found this almost entirely unknown system by Grimes, dug up the surviving material on it, and managed to get in contact with Grimes to get some additional information. Especially interesting that Grimes, an anthropologist, built the system as a means to an end for field work, not as an end in itself. For most other systems, built by AI and computational creativity researchers, "AI that can write stories" is the end goal. That also explains why Grimes didn't bother to publish much on it. The system is based on implementing Propp's formalist folktale morphology in a multilingual generative system, and then using the stories for field work on indigenous languages in Mexico. Brings it closer in a sense to Propp's own work as folklorist. Alas, he found the generated stories were too boring.

2017-08-25

Some natural solutions to the p-value communication problem—and why they won't work

A. Gelman, J. Carlin. J. American Statistical Association, 2017.

Tags: statistics

[doi]

Excerpt:

It is well known that even experienced scientists routinely misinterpret p-values in all sorts of ways, including confusion of statistical and practical significance, treating non-rejection as acceptance of the null hypothesis, and interpreting the p-value as some sort of replication probability or as the posterior probability that the null hypothesis is true. [...] At this point it would be natural for statisticians to think that this is a problem of education and communication. If we could just add a few more paragraphs to the relevant sections of our textbooks, and persuade applied practitioners to consult more with statisticians, then all would be well, or so goes this logic. [...] The problem does not seem just to be technical misunderstandings; rather, statistical analysis is being asked to do something that it simply can’t do, to bring out a signal from any data, no matter how noisy. We suspect that, to make progress in pedagogy, statisticians will have to give up some of the claims we have implicitly been making about the effectiveness of our methods.

If you like your takes on the replication crisis pessimistic, here's a paper title for you. They go through five specific proposed solutions and argue for why each won't work: 1) Listen to the statisticians, or clarity in exposition, 2) Confidence intervals instead of hypothesis tests, 3) Bayesian interpretation of one-sided p-values, 4) Focusing on "practical significance" instead of "statistical significance", and 5) Bayes factors. What they do propose is a more fundamental rethink of what statistical methods can and can't reliably tell you, which they admit is a harder sell.

2017-07-27

Twenty things to do with a computer

S. Papert, C. Solomon. MIT AI Lab Memo, 1971.

Tags: creative coding, computers in education

[pdf]

Excerpt:

When people talk about computers in education they do not all have the same image in mind. Some think of using the computer to program the kid; others think of using the kid to program the computer. [...] In the real world computers are used in many different ways. Some are programmed to fly airplanes; not to *tell* a human pilot what to do, but to pull levers with their own electro-mechanical effectors... [...] Why then should computers in schools be confined to computing the sum of the squares of the first twenty odd numbers and similar so-called "problem-solving" uses? Why not use them to produce some action? [...] But our purpose here is not to complain of what other people have not done, but to tell of some of the exciting things you can do with the computer you have now or with the one you will be incited to get by the pages that follow.

The opening quip about using the computer to program the kid vs. using the kid to program the computer is funny and still apt. Besides that, the paper does what it says: explain 20 educational things you can do on a computer, in this case using Logo. Mostly turtle graphics examples, plus some simple generative music, a poetry generator, etc. The thing that most struck me about it is how early 1971 sounds as a date for this. Had this paper been written in the middle of the home-computer revolution I wouldn't have been surprised, but I didn't realize that, at least in some Logo-using classrooms, kids were writing generative poetry in 1971! The paper ends with a brief exhortation to administrators that computers are more affordable than they think. It quotes a cost of "$30 per student per year, for one hour per student per week of terminal time" and extends an offer to help interested educators make their case to funders.

2017-07-20

Computer of a thousand faces: Anthropomorphizations of the computer in design (1965-1975)

T. Vardouli. Dosya, 2012.

Tags: computational design, human-computer interaction

[pdf]

Excerpt:

[L]ooking back to the first encounters of computers and design, one discovers a rich legacy of speculation on the implications of this merging. Inquiry into the early computational era (1965-1975) can therefore expose (part of) the cultural and historical origins of the popular but loosely defined term "computational design." [...] The intense impulse to situate the new entity of the computer in the traditional, empirical processes of design lead to assignments of anthropomorphic roles to the machine, such as the "clerk," the "partner," the "accountant," and others. These different "occupations" were eloquent metaphors denoting different approaches to the ways that the innate characteristics of the computer could be reconciled with the elusive characteristics of design, as well as to the new relationship of the machine as a design actor with the designer-author. The main body of this paper places these metaphors in conversation, thus revealing different models of computation, as well as different processes of design. The purpose of this short survey is to bring forth computational "role models" which survive until today, assert them as historical and cultural artifacts, and present their conceptual counterpoints, re-opening them for discussion.

Surveys anthropomorphic terms used in early work on computational design to describe various roles a computer could take in the design process. The five roles identified are: The Clerk, wherein "the computer would liberate the designers from the tedious, quantitative tasks involved in design, thus allowing them to channel their energy towards the truly creative parts of the design process"; The Partner, "a model which surpassed the rigid division of labor in the design process and called for a partnership between the computer and the designer"; The Wizard, which provides "imminent, seamless spatialization of the user's design intentions, prior even to their verbalization" and politically aims at de-professionalizing architecture; The Surrogate, a kind of scaled-back Wizard where "the role of the machine was to empower non-expert users, who knew very little about design but plenty about their living preferences ... By making inferences on the user's sketches and statements the computer would ideally be able to construct a model of the user and therefore operate as his surrogate, his expert alter ego, his own native architect"; and finally The Accountant, with the computer "objectively recording personal and collective histories and feeding them back to users and communities without 'agency' or 'intelligence' ... The accountant just kept the books; it was up to the inhabitants to own and manage the data in order to reflect on the implications of their past actions and plan their collective futures." These roles are primarily drawn from Nicholas Negroponte's work, except for The Accountant, which comes from Yona Friedman.

2017-05-30

Propositional & analogical generation of coordinated verbal, visual & musical texts

S. Klein et al. SIGART Bulletin, 1982.

Tags: story generation

[doi]

Excerpt:

We are developing a model for human cognitive processing which assumes that a major component of the rules for calculating behavior are resident outside the individual, in the inherited, collective phenomena that anthropologists call 'culture'. Our model contains rules of behavior encoded in propositional structures such as frames or scripts, plus a method for calculating behavior by analogy. Propositional rules are transformed into situation state descriptions that are related by transformation operators which are also valid for transforming situations by analogy, and for planning by analogy. [...] We use the simulation model as a generator of operas, complete with textual, pictorial and musical output, all derived from a common semantic source. The output is in the medium of a videotape recording, and is generally entitled, Revolt in Flatland.

One of several interesting finds in the history of story generators dug up by James Ryan of UCSC. An automatic opera generator, producing its output directly to videotape, in 1982. This paper has 9 authors, which suggests a fairly large project, but this seems to be the only publication about the project (though it builds on some technologies that have their own publications), and it's been cited one time ever, so the project mostly passed without any attention. Strange, but probably partly due to Sheldon Klein's rather unorthodox way of working and publishing that seems to have led him to build a number of very interesting systems that got relatively little attention. I wonder if any of the videotapes still exist?

2016-12-02

Classification accuracy is not enough: On the evaluation of music genre recognition systems

B.L. Sturm. J. Intelligent Information Systems, 2013.

Tags: computer music, machine learning

[doi]

Abstract:

We argue that an evaluation of system behavior at the level of the music is required to usefully address the fundamental problems of music genre recognition (MGR), and indeed other tasks of music information retrieval, such as autotagging. A recent review of works in MGR since 1995 shows that most (82%) measure the capacity of a system to recognize genre by its classification accuracy. After reviewing evaluation in MGR, we show that neither classification accuracy, nor recall and precision, nor confusion tables, necessarily reflect the capacity of a system to recognize genre in musical signals. Hence, such figures of merit cannot be used to reliably rank, promote or discount the genre recognition performance of MGR systems if genre recognition (rather than identification by irrelevant confounding factors) is the objective. This motivates the development of a richer experimental toolbox for evaluating any system designed to intelligently extract information from music signals.

Intriguing paper digging into how you determine whether a music-genre recognition system is actually recognizing genre. Argues that the common headline metric, classification accuracy on some labeled dataset, is often misleading (significantly overstating success). Probably applicable to other ML domains as well. One example of why: many systems don't really recognize genre, but specific technical mastering features that happen to correlate strongly with genre in their particular dataset (not musical features at all, but features of the recording/production process). For example, specific reverb effects tend to be associated with live performance in classical halls, and country-western-specializing studios use certain kinds of mastering/production techniques that are rarely used by rock-specializing studios and vice versa. For some applied classification purposes that's fine, but for others it's clearly not too satisfying, and may cause generalization failures.