Comments on The Invisible Things Lab's blog: Converting untrusted PDFs into trusted ones: The Qubes Way

Sorry, wrong URL, this is the correct repo, of cou...

2014-05-16T18:02:26.183+02:00

Sorry, wrong URL, this is the correct repo, of course:

http://git.qubes-os.org/?p=joanna/qubes-app-linux-pdf-converter.git;a=summary

@J.M. Porup: the repo has been renamed since that ...

2014-05-16T18:01:49.150+02:00

@J.M. Porup: the repo has been renamed since that time, hence the links are giving 404s. The repo is now here:

http://git.qubes-os.org/?p=joanna/antievilmaid.git;a=summary

You should be able to find all the referenced files in this new repo.

Joanna the links to git.qubes-os.org all return 4...

2014-05-16T16:00:01.252+02:00

Joanna

the links to git.qubes-os.org all return 404. Would like to take a look at the wrapper scripts.

thanks
j

GSview PStoEdit: http://pages.cs.wisc.edu/~ghost/g...

2013-08-16T21:38:08.353+02:00

GSview PStoEdit:
http://pages.cs.wisc.edu/~ghost/gsview/pstoedit.htm
Converter to Vector Formats:
http://www.pstoedit.net/pstoedit/
Including SVG:
http://www.helga-glunz.homepage.t-online.de/plugins/

Assuming you can get the source.
I can't help with needing to spawn a Disposable VM to securely do the conversion... BUT you should need to: (A) you can have a list of one more DVMs pre-running ready to go (B) you could reuse such PDF->SVG conversion DVMs after some number of jobs or for a session sending them back to sleep to await the next conversion job... Unfortunately you might have to compromise security for your sanity here but the SVG (XML) output of a behind the scenes PDF Converter should not propagate infections...

Let me say I admire your work Joanna and I am not ...

2013-07-31T11:48:23.990+02:00

Let me say I admire your work Joanna and I am not a PDF expert either.
Rather than go the "Raster" (RGB Bitmap) route I would go the "Vector" one and better yet using a Declarative e.g.

XML-based format supporting both Raster and Vector data with robust parsers but without embedded Procedural Code

like PDF and PS other than within well defined tags like "script" that can be readily stripped.
This problem with embedded "code" exists for all these big major file formats including MS and Open Office files

and such things as CAD files and will be a constant bugbear for security (and not limited to, but also including

your Qubes). XML based file formats are a saving grace given the simpliity and ubiquity of tested parsers along

with the data/code divide mentioned.
Consider a series of transformations between formats with each one using a different parser that also to strip out

all code and macros (and so there will be no dynamism or animated imagery in the regenerated PDFs for example). If

you do enough transformations with enough code stripping then nothing is going to get through as an attacker would

have to target multiple vulnerabilites accross multiple libraries in order to do so and such a chain can be

arbitrary or random both in terms of components used and the length of the chain e.g.:

PDF => ... => .... => PDF

You are currently using Bitmaps but SVG would be a much better option:

PDF => SVG => PDF

There is open source available including multiple Apache PDF engines along with SVG2PDF and PDF2SVG

projects available.

Feel free to contact me offline to discuss further since I really wanted to follow Gandalf's advice and: "Keep it

secret; Keep it safe..." before I actually started writing this...

I am actually working on it (and other things more...

2013-04-18T17:34:13.864+02:00

I am actually working on it (and other things more or less related) on my free time. I am still documenting myself on the pdf format.

Code comes last. You need a good design first. Things can be made simple on purpose. If the white list is pretty restrictive and hardcoded in the parser, I am confident it is doable.

One does not have to do it in one go. One can first demonstrate that a function does what it is meant to do considering its ins and outs. Once done one can just assume the function works correctly and considere the bigger picture without going back on it.

Most of the time the proof is done by exhaustion considering a general case and a lot of special cases (boundaries, overflow, etc).

Automatising it is hard but doing it by hand is relatively easy (with common program structure : no self modifying code, etc). It is just time consuming but with a program small enough it is manageable.

A parser that copy what it recognize and ignore the rest can be fairly small.

anonymous-who-thinks-one-can-formally-prove-a-pdf-parser (and made you laugh)

@anonymous-who-thinks-one-can-formally-prove-a-pdf...

2013-04-18T16:33:17.226+02:00

@anonymous-who-thinks-one-can-formally-prove-a-pdf-parser: your comment made me laugh.

Anyway, if you think this is all that simple, then where is the code? Everybody is good at talking, but few can actually write something, huh?

Security by correctness seems more relevant here. ...

2013-04-18T16:20:55.787+02:00

Security by correctness seems more relevant here. The parser can be small enough to allow the demonstration of its correctness. I mean a formal demonstration like one would do with a mathematical theorem (taking into account the way numbers are represented, memory limitations, etc). (You can show that whatever data (not necessarily pdfs but any random data) is fed to your program, it will always behave as expected.)

All this parser need to do is locate the safe areas (white listing) and transmit it to a second program which will turn it into a correct (striped) pdf file (or something different).

All this assume we can trust system calls for disk access. If we can't, it means that storing, copying or reading it (eg. with an hexadecimal reader) will compromise the system.

Actually the image compression code can be of arbi...

2013-03-09T18:10:24.068+01:00

Actually the image compression code can be of arbitrary complexity, as one would normally be doing the compression in the trusted VM on the already verified RGB format. So, unless we fear attacks that could exploit a compressor by feeding it with "strange" bytes to compress (in contrast to feeding a decompressor with strange input file), a rather unthinkable situation for any real-world compressor IMO, we should be fine with any compressor.

Alon: A jpg parser is significantly more complex t...

2013-03-09T16:44:22.273+01:00

Alon: A jpg parser is significantly more complex than a raw RGB one. In fact one could say that raw RGB data needs no parsing, just read into memory and display.
If space is a concern then a simple run-length-encoding scheme which is trivial to implement safely would improve the situation.

Jake: It is the premise that is important here, not a specific implementation. XPDF lacks a lot of features that modern PDF viewers offer and even so, I do not think that one can say with certainty that it is "safe code" since nothing has been formally proven. Even if there are no publicly disclosed XPDF vulnerabilities, the fact remains that it could still be vulnerable.

I think that people should focus more on the ideas and premises behind Joanna's posts (which are all solid and offer tangible security benefits, something that is extremely rare in the defensive domain) than trying to come up with more user-friendly (yet inherently flawed) alternatives to the specific scenarios that she brings up.

Compartmentalization works because it allows one to think about risk and exposure and *factor* them into his everyday decisions. It may not be as simple as click-and-forget but it is the best solution we have today and Qubes does an excellent job of balancing the scales.

Hi Joanna, Nice post and an interesting approach. ...

2013-03-04T19:23:49.750+01:00

Hi Joanna,
Nice post and an interesting approach. Some of the original motivation for the PDF format was saving space. The PDF is (to my knowledge) a vector format, which saves lots of space on big documents. Since you convert it back to RGB-based, it's uncompressed and I assume it'll take a lot of space, like BMP files mentioned by another comment. The next reasonable step in your approach would be IMHO to switch to better representation of graphical data, e.g. use JPG or others instead of "pure" RGB representation. I guess you've probably considered it by now.

BTW I personally think that in terms of functionality and scalability, as you remarked with the Intel example, the solution must be white list, i.e. disassembling and building it back.
The PDF is overall a good format, it just uses a little bit too much functionality. You're solution basically destroys the format completely :)

Wow, I understand your approach, despite copy-past...

2013-03-01T01:06:20.213+01:00

Wow, I understand your approach, despite copy-paste shell scripting "ability;" I'm going to try out Qubes, I need at least one software environment that works.
Probably the most useful thing I can contribute is this: You need Free(as in GNU) hardware for this to end well, O level ownership from microcircuitry to microcode on up, otherwise fake security(opacity) will trump real security. And a clean electricity/EMF environment or it may well "Blogger" on you due to institutional resistance to the "dangerous" notion you have a right to do things on your computer unmolested;)

nice way of not answering my question. ;)

2013-02-24T14:58:43.154+01:00

nice way of not answering my question. ;)

@Jake: if you consider RGB parsing to be "com...

2013-02-23T15:42:00.172+01:00

@Jake: if you consider RGB parsing to be "complex", then I wonder what do you possible consider to be "simple"? ;)

i think this, and qubes in general, is an interest...

2013-02-23T14:20:10.787+01:00

i think this, and qubes in general, is an interesting approach to overall security. however, the complexity of it all seems rather high.

there have been lots of exploits for adobe products published but comparatively fewer for open source pdf viewers, e.g. xpdf. how do you justify the increase in complexity versus the likelihood of various attack vectors, e.g. owned windows machine with an adobe pdf exploit?

... of course we can also run such proxy in a Disp...

2013-02-23T12:26:23.457+01:00

... of course we can also run such proxy in a Disposable VM, no problem, if somebody likes blacklisting approaches. Here the benefit would be that we don't have a centralized single point of failure that can later turn against us.

@Theodoros: 1) Your solution is a centralized one...

2013-02-23T12:24:25.104+01:00

@Theodoros:

1) Your solution is a centralized one -- not only it creates privacy concerns and cannot be used for end-to-end encrypted documents, but also is much less secure, as if your "safe" parser got compromised than it will be able to steal all subsequent documents (and leak them out to China) as well as infect clients by serving them malicious PDFs.

2) Your approach to process PDF is what I described in the article as a black list solution, with all the usual problems such solutions have.

In other words I fail to see how this solution could be better than just using a somehow "hardened" PDF viewer. Actually the latter would be better because we avoid the centralized single point of failure, i.e. your proxy.

Joanna nice post!!!! Have you ever thought on the...

2013-02-23T01:37:27.061+01:00

Joanna nice post!!!!

Have you ever thought on the following solution ?
http://www.sans.org/reading_room/whitepapers/intrusion/animal-farm-protection-client-side-attacks-rendering-content-python-squid_33614

Its used python and squid to render the pdfs.

@dragosr: I don't quite understand what criter...

2013-02-22T18:59:34.355+01:00

@dragosr: I don't quite understand what criteria you used to state that "safe code is the best way to parse PDFs safely"? The absolute safety of so called "safe languages" is a myth -- ever wondered why Singularity never taken off?

Also @dragosr: Using a "sacrificial machine" for PDF processing, whether it is real or virtual, is only good if you deal with unclassified, public PDFs only. But if you need to open a confidential contract, or NDA-protected documentation, then using a sacrificial machine would be irresponsible at best, and might even get you into legal problems if those confidential PDFs leak out of your "sacrificial machine", because every decent NDA would explicitly require you to treat the confidential material given to you with great care. And, then, on the other hand, it's also unreasonable to assume that every non-public PDF is non-malicious.

I am just suggesting the same approach used to spe...

2013-02-22T18:34:14.148+01:00

I am just suggesting the same approach used to speed up other things (pre-loading tabs, linked pages, hard drive blocks,...). I assume that it could be tweaked based on available (meaning not used, or wasted...) memory. There is always a draw back.

The best way to parse PDFs safely is still probabl...

2013-02-22T17:24:34.880+01:00

The best way to parse PDFs safely is still probably to parse them safely with safe code. Applying a band-aid on a turd still leaves you with a turd. interesting ideas though. right now I use a sacrificial machine for pdfs - kind of a real world physical equivalent of your VMs, and a little less bother at a little more hardware expense.

@nicolas: it's not "mine", it's ...

2013-02-21T23:29:53.819+01:00

@nicolas: it's not "mine", it's ImageMagick's.

Also @nicolas: how many Disposable VMs should you keep in a queue to look through a dozen of invoices in a folder?

Your raw RGB format is actually extremely close to...

2013-02-21T21:29:24.713+01:00

Your raw RGB format is actually extremely close to the Windows BMP 24 so you could have used that. The header only really contains width and height.

You could keep a pre-loaded disposable VM in memory to make disp vm startup faster. I do not know at what point of the loading portion they need to be "specialized"