🧵 View Thread
🧵 Thread (35 tweets)

You don't have to read https://t.co/rTwKKf8IKx to understand this thread but it helps. TLDR: Hypothesis takes a random generator of stuff and gives you "test-case reduction" for it, which lets you ask it to generate things that are like that thing it gave you but smaller.

The problem is that choices can depend on previous choices, so this can mess things up. We solve this by representing all choices as sequences of bits (coin flips) and just hoping they mean roughly the same thing if we change things across runs.

Hypothesis has been designed to work well on its standard library of generators and on user code we've seen that uses them. It was never really designed with arbitrary generators in mind. Also it was mostly designed to work with things which were smallish to start with.

Anyway, a while back @johnregehr and others wrote Csmith, which is a generator of C programs designed to break compilers in entertaining ways. Some time last (?) year @afd_icl wondered whether we could hook Hypothesis up to drive Csmith for our paper evaluation.

It took a little bit of dark magic, but we can. It even works pretty well. Relevant dark magic for the interested: https://t.co/QYdorp4vEX

This is annoying. It should work. So over the last ~week I've been poking at how to make it work. This builds on the infrastructure I described in https://t.co/mCxmyoZbWI where I'm trying to make it just much easier to do research level work by building infrastructure for that.

People have a wide variety of different backgrounds, and advice tends to be optimised for a generic model human that ignores this diversity. It can be worth taking a step back and asking if this is causing you to ignore easier or better approaches. https://t.co/X9trdcx8SE

We thought that the problem was some sort of memory leak in Hypothesis (where it doesn't ever throw away data) and it *kinda* is, but really the problem is more that Hypothesis just spends ages not making very good progress, so it's working with big examples for far too long.

This is weird. Why is it weird? Well, test-case reduction is basically a local process. It repeatedly tries small variants of its current best example to see if it can make it a bit smaller. *ideally* early on it makes big progress, then finishes with lots of small polishing.

The problem with Csmith + Hypothesis is that it doesn't have that early on big progress. It spends a lot of time early on making very little progress, essentially running those polishing phases almost immediately without ever making the test case much smaller.

The way this shows up is that if you focus on the structure that corresponds to e.g. a generated function, there's almost nothing useful you can do to it. Replacing with a simpler function, or removing it altogether, seems not to work.

Probably it affects state and invalidates later choices or something. Haven't investigated. I spent a good chunk of the last week trying to make Hypothesis work with the high level structure of Csmith generated test cases and made no progress.

Turns out, equally bafflingly, that the opposite happens if you focus on low level structure. Hypothesis has a little DSL called "block programs" which is designed for reductions that just affect a short region of choices.

In general block programs are designed for transformations that don't necessarily make much sense in the context of the generated test case, or that you might not have thought worth special casing. e.g. XX is why Hypothesis can transform [[1, 2], [3, 4]] into [1, 2, 3, 4].

They're not *arbitrary* arbitrary - they're still what they were before - but nine times out of ten you've changed the meaning of something, so if it works you got lucky. But if you roll the dice enough times you eventually get lucky, so block programs are often good anyway.

In particular they're good for escaping weird local minima sometimes. e.g. that '--X' program was added because sometimes it gets one our tests for generating pandas data frames unstuck. No particularly principled reason for that, it just does.

Anyway the weird thing with Csmith, and what causes Hypothesis to make progress on it eventually, is that even though modifying the high level structure doesn't work very well, for some reason block programs work unreasonably well on it.

The annoying thing is that block programs are by their nature pretty fine grained - they make small amounts of progress at a time - but if we can find block programs that reliably make progress then we can still inch our way to the finish line pretty well.

Does this work? Well... it doesn't *not* work. It's making much better progress than anything else I've tried on Csmith. It's still extraordinarily slow by Hypothesis standards, but that's maybe just what reduction on large test cases looks like.

This affects a sequence of 11 choices as follows: 1. Delete it 2. Delete it 3. Subtract one from it 4. Zero it 5. Add one to it 6. Randomise it 7. Delete it 8. Delete it. 9. Delete it. 10. Subtract one from it. 11. Add one to it.

Apparently (and I'm embarrassed to not have noticed this before) part of why this works is that Csmith is just really very happy to have contiguous sequences of choices deleted and tends to work when you do this. "X" * n work well for a lot of values of n.

This is particularly interesting because: 1) It *doesn't* seem to work particularly well for n <= 2 2) I'd previously assumed it wouldn't because deletions based on more structural information (trying to delete regions that correspond to nice generated structure) doesn't.