Community Archive

🧵 View Thread

🧵 Thread (35 tweets)

David R. MacIver@DRMacIver• over 5 years ago

Eh, this is taking a while to run, so lets do a thread about my technical work at the moment.

6 4

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

You don't have to read https://t.co/rTwKKf8IKx to understand this thread but it helps. TLDR: Hypothesis takes a random generator of stuff and gives you "test-case reduction" for it, which lets you ask it to generate things that are like that thing it gave you but smaller.

2 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

Hypothesis's test-case reduction works by manipulating the choices made during generation to see if it can fiddle with the output - you chose the number 10 here? What happens if you chose 5 instead?

2 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

The problem is that choices can depend on previous choices, so this can mess things up. We solve this by representing all choices as sequences of bits (coin flips) and just hoping they mean roughly the same thing if we change things across runs.

2 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

Hypothesis does good reduction more or less to the degree that it can figure out how to manipulate these choices, which depends a lot on the ways the meaning of choices depend on the meaning of other choices.

1 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

Hypothesis has been designed to work well on its standard library of generators and on user code we've seen that uses them. It was never really designed with arbitrary generators in mind. Also it was mostly designed to work with things which were smallish to start with.

2 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

Anyway, a while back @johnregehr and others wrote Csmith, which is a generator of C programs designed to break compilers in entertaining ways. Some time last (?) year @afd_icl wondered whether we could hook Hypothesis up to drive Csmith for our paper evaluation.

6 1

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

It took a little bit of dark magic, but we can. It even works pretty well. Relevant dark magic for the interested: https://t.co/QYdorp4vEX

2 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

The problem is that "Hypothesis is only designed for smallish things" part. Csmith generates large things. This doesn't work out so well for Hypothesis, so for our paper we selected only small examples.

1 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

This is annoying. It should work. So over the last ~week I've been poking at how to make it work. This builds on the infrastructure I described in https://t.co/mCxmyoZbWI where I'm trying to make it just much easier to do research level work by building infrastructure for that.

David R. MacIver@DRMacIver• over 5 years ago

People have a wide variety of different backgrounds, and advice tends to be optimised for a generic model human that ignores this diversity. It can be worth taking a step back and asking if this is causing you to ignore easier or better approaches. https://t.co/X9trdcx8SE

8 1

2 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

We thought that the problem was some sort of memory leak in Hypothesis (where it doesn't ever throw away data) and it *kinda* is, but really the problem is more that Hypothesis just spends ages not making very good progress, so it's working with big examples for far too long.

1 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

This is weird. Why is it weird? Well, test-case reduction is basically a local process. It repeatedly tries small variants of its current best example to see if it can make it a bit smaller. *ideally* early on it makes big progress, then finishes with lots of small polishing.

1 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

The problem with Csmith + Hypothesis is that it doesn't have that early on big progress. It spends a lot of time early on making very little progress, essentially running those polishing phases almost immediately without ever making the test case much smaller.

1 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

Why? Well, it turns out that all the clever bits Hypothesis has for deleting choices from the test case based on its high level structure don't work on Csmith. Why not? Beats me.

1 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

The way this shows up is that if you focus on the structure that corresponds to e.g. a generated function, there's almost nothing useful you can do to it. Replacing with a simpler function, or removing it altogether, seems not to work.

1 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

Probably it affects state and invalidates later choices or something. Haven't investigated. I spent a good chunk of the last week trying to make Hypothesis work with the high level structure of Csmith generated test cases and made no progress.

1 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

Turns out, equally bafflingly, that the opposite happens if you focus on low level structure. Hypothesis has a little DSL called "block programs" which is designed for reductions that just affect a short region of choices.

2 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

e.g. block_program("--X") means "subtrack one from two subsequent choices then delete the next one" (A choice here is a sequence of bits requested in one go via a call to draw_bits(n))

1 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

Hypothesis runs a few block programs as part of its standard reduction. These are: XXXXX XXXX XXX XX X -XX --X Why these? No really good reason, they just work pretty well.

1 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

In general block programs are designed for transformations that don't necessarily make much sense in the context of the generated test case, or that you might not have thought worth special casing. e.g. XX is why Hypothesis can transform [[1, 2], [3, 4]] into [1, 2, 3, 4].

2 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

When you run a block program at some point in the choice sequence you made, chances are you've basically just thrown your arms up and said "Every choice after this point is essentially arbitrary!"

1 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

They're not *arbitrary* arbitrary - they're still what they were before - but nine times out of ten you've changed the meaning of something, so if it works you got lucky. But if you roll the dice enough times you eventually get lucky, so block programs are often good anyway.

2 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

In particular they're good for escaping weird local minima sometimes. e.g. that '--X' program was added because sometimes it gets one our tests for generating pandas data frames unstuck. No particularly principled reason for that, it just does.

3 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

Anyway the weird thing with Csmith, and what causes Hypothesis to make progress on it eventually, is that even though modifying the high level structure doesn't work very well, for some reason block programs work unreasonably well on it.

1 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

The annoying thing is that block programs are by their nature pretty fine grained - they make small amounts of progress at a time - but if we can find block programs that reliably make progress then we can still inch our way to the finish line pretty well.

1 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

The thing that I've been noticing this morning is that block programs work *so* well on Csmith that if you pick a random one chances are good that it's worth running.

3 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

So I've been working on a new reduction algorithm that selects a bunch of block programs by: 1. Generating one at random. 2. Runnning it a random place in the choice sequence. If that works, great, add it to the list.

3 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

Once the list is big enough, run all of those mixed together to try to reduce the choice sequence down to something small and cute. Keep doing this until we're no longer making progress.

1 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

Does this work? Well... it doesn't *not* work. It's making much better progress than anything else I've tried on Csmith. It's still extraordinarily slow by Hypothesis standards, but that's maybe just what reduction on large test cases looks like.

5 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

Anyway, that's what I'm working on today. Like I said: It requires about a page to explain the backstory.

4 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

To give an example of the sort of weird block programs that end up working really well on Csmith, the following is currently making a huge amount of progress: XX-0+RXXX-+

2 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

This affects a sequence of 11 choices as follows: 1. Delete it 2. Delete it 3. Subtract one from it 4. Zero it 5. Add one to it 6. Randomise it 7. Delete it 8. Delete it. 9. Delete it. 10. Subtract one from it. 11. Add one to it.

1 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

Does this reveal something deep about the structure of Csmith's generator? Eh, probably not. But as long as it makes its output smaller, that's good enough for me.

3 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

Apparently (and I'm embarrassed to not have noticed this before) part of why this works is that Csmith is just really very happy to have contiguous sequences of choices deleted and tends to work when you do this. "X" * n work well for a lot of values of n.

1 0

6/16/2020

David R. MacIver@DRMacIver• over 5 years ago

Replying to @DRMacIver

This is particularly interesting because: 1) It *doesn't* seem to work particularly well for n <= 2 2) I'd previously assumed it wouldn't because deletions based on more structural information (trying to delete regions that correspond to nice generated structure) doesn't.

2 0

6/16/2020