Thursday, February 21, 2013

Converting untrusted PDFs into trusted ones: The Qubes Way

Arguably one of the biggest challenges for desktop security is how to handle those overly complex PDFs, DOCs, and similar files, that are so often exchanged by people, or downloaded from the Web, and that often provide a way for the attacker to compromise the user's desktop system.

Today I would like to discuss a recent innovation we created for Qubes OS that allows to securely convert those pesky PDFs (as well as essentially any graphics files) into trusted PDFs. Here by a “trusted PDF” I mean a file that should be harmless to the user's system, so, a non-malicious PDF.


A few years ago, we have already introduced a mechanism in Qubes OS called Disposable VMs, that can be used to safely open any file, including PDFs, DOCs, etc. The file is being opened in a... well a dedicated disposable VM that is created within seconds (typically below 5 seconds) and all the file processing and rendering happens inside this VM. Once the document is closed, the disposable VM is automatically destroyed, and any changes to the file (e.g. if the was an editable DOC file) are automatically propagated back into the original file. This mechanism is very powerful, and I often use it for my daily work. However, it surely is a bit cumbersome – who wants to wait 5 seconds for the PDF to open, especially if I have a dozen of invoices to look through! So, today I present an alternative approach...
 
Approaches to converting PDFs

The problem of converting a potentially untrusted PDF into a harmless one is certainly not a new one. Some tools have already be created for this task.

The typical approach is to parse the original PDF, look for “potentially dangerous things” there, and remove them. As simple as that! This is, of course, a typical AntiVirus approach to the problem. And, typically as it is for most AV approaches, it's completely useless against any more skilled and determined attacker (and these are the ones we fear the most, don't we?).

A somehow better approach is to parse the original PDF, disassemble it into pieces, and then reassemble them into a new PDF only using the “trusted” pieces – this, I think, could be called a whitelisting approach.

Anyway, the fundamental problem with the approaches mentioned above, is that all of them require parsing of the original PDF file. And parsing is where the “big bang” usually happens. Parsing is where our, normally pretty decent, code, comes in close, intimate contact with some unknown complex input data, which often leads to a successful abuse or exploitation.

Parsing PDFs safely

So, how to perform parsing safely? Of course, that's simple! Let's run the parser in an isolated container – in case of Qubes we already have an ideal such container: it's the Disposable VMs.

But, before we get too excited, let's think more about it – say we run the parser safely isolated in a Disposable VM, meaning it couldn't harm any of the rest of the system, except for the Disposable VM itself, which we however won't worry much about, because it is disposable... But then what?

We want our PDF back in our original VM, to actually use it, right? But we cannot just copy the result from the Disposable VM, because if it got compromised, as a result of parsing of the malicious PDF, then we would like get... a compromised converted PDF. So, this approach gives us nothing!

(Even though our “solution” incorporates all the obligatory buzzwords: “Disposable VMs” (“Micro Disposable VMs”?), “VMs isolated using hardware Intel vPRO technology” and, of course, the “hypervisor”! Sometime just the mere fact we use “hardware virtualization” buys us nothing... People seem to forget about this sometimes.)

So, the trick to make this approach meaningful is to introduce what I will call a “Simple Representation” of the input file. More on this, straightforward concept, below. The idea is that our parser (that runs in a Disposable VM) will be expected to return the Simple Representation of the original PDF. Of course, it might very well go wild (as a result of exploitation by the PDF it parses), and don't obey our expectations, and instead return something totally different and potentially malicious. But that doesn't matter! The whole point of the Simple Representation is that it should be, well... simple to parse it safely and discard in case what we're getting doesn't look like the Simple Representation.

Ok, so what's the simplest possible representation of an arbitrary PDF file? Yes, it's the RGB format, which is essentially just a raw array of RGB values for each pixel. In fact, I'm not sure there could be anything simpler in the Known Universe to represent a PDF file...

Now this is all becoming simple: we would expect our parser to send us just two things: the dimensions (W x H) of the bitmap representation of each of the page of the PDF in question, and each of the PDF page itself converted into a raw RGB format. If the parser didn't obey, we would still interpret whatever stream of bytes we get as a RGB bitmap – in the worst case the PDF we create would look like un-tuned analog TV screen.

The diagram below summaries this idea:



Implementing this all on Qubes

Now I would like to show how easy it is to implement such PDF converter service using the Qubes advanced infrastructure that we call qrexec, and which is part of Qubes core for quite some time now.

First, let's choose the PDF and image conversion tools. The choice of PDF converter is not security critical, because it will run in an isolated Disposable VM. Here I decided to use pdftocairo converter, which is part of the poppler-utils package on Fedora. We will also use ImageMagick's “convert” command to convert the PNG files (produced by pdftocairo, one for each PDF page) to the raw RGB format. Incidentally ImageMagick supports RGB format natively. As mentioned above, in addition to sending the raw RGB file, we would also need to send the width and height of the pixmap – those can easily be obtained using ImageMagick's “identify” command. Again, all those programs discussed so far are not security critical – they might get exploited during the processing of the untrusted input PDF file, and we don't worry about that at all.

On the receiving side, however, we need to use a foolproof parser for the RGB format. Again, this is what we gain in this whole process – instead of requiring a foolproof-and-also-being-able-to-produce-non-malicious-PDFs parser, we only require a foolproof RGB parses, and that's quite a gain! The ImageMagick's convert comes to mind again here, and one might want to use it like this:

convert page.rgb page.pdf

Unfortunately this would be wrong, because the convert program would still try to detect the “real” format of the page.rgb file, and, if it looked more like, say, JPEG or PDF, it would parse it accordingly, compromising all our careful plan! What we really need is to tell our convert program to always treat the input as raw RGB file, instead of trying to be (too) smart and trying to guess the format by itself. This can be achieved by adding the “rgb:” prefix in front of the input argument, which provides explicit input format specification:

convert -size ${IMG_WIDTH}x${IMG_HEIGHT} -depth ${IMG_DEPTH} rgb:$RGB_FILE pdf:$PDF_FILE

Now also needed to add size and depth explicitly, because the raw RGB format doesn't convey such information (well, it has no header of any sort at all!). Of course we need to obtain the width and height from the parser, but we can validate such input rather easily. In addition we make sure that the received RGB file has exactly the size as indicated by width and height. With those precautions in place, there would have to be really a gapping hole in the ImageMagic's RGB parsing code for the attacker to exploit this. Perhaps instead of using the ImageMagick's convert I should have written a small script in python that would parse the received RGB file (and save it into a... RGB file, for later processing by ImageMagick), but I sincerely think this would be an overkill here. 
 
Finally we can write the following two simple bash scripts, one for client: qpdf-convert-client, and the other one, qpdf-convert-server, for the server (which runs in a Disposable VM).

Additionally we also need to create a policy file in Dom0 in /etc/qubes_rpc/policy/ to allow to use this service. The policy file content for this service should look like this:

$anyvm $dispvm allow

... which is pretty self explanatory. When I do development I also add another line to the policy file like this:

$anyvm devel-vm ask

... to allow me to run the server inside my 'devel-vm' VM, instead of running it in Disposable VM every time, which would be very inconvenient for development, as it would require me to update the Disposable VM template each time I wanted to test a new version of qpdf-convert-server.

The policy file should be placed in Dom0 in /etc/qubes_rpc/policy/qubes.PdfConvert file – here the name of the file must be the same as the name of the service, as invoked via qrexec_client_vm command, discussed below.

And, one last thing, in the destination VM we must also create a file that will map the service name (so, the qubes.PdfConvert in our example) to the actual binary that should be called in the VM when the service is invoked. So, the file should be named: /etc/qubes_rpc/qubes.PdfConvert (again, this is now in a VM, not in Dom0, also note the lack of policy/ subdir), and it is another one-liner with the following content:

/usr/lib/qubes/qpdf-convert-server

The full source code of qpdf-converter can be seen and downloaded from this git repo.

We're ready now to test our qubes.PdfConvert service: in the requesting VM, i.e. the one from which we want to initiate the conversion process we do:

[user@work Downloads]$ /usr/lib/qubes/qrexec_client_vm '$dispvm' qubes.PdfConvert /usr/lib/qubes/qpdf-convert-client ITLquote.pdf
-> Sending file to remote VM...
-> Waiting for converted samples...
-> Receving page 2 out of 2...
-> Merging pages into a single PDF document...
-> Converted PDF saved as: ITLquote.trusted.pdf
-> Original file saved as .ITLquote.pdf

Again, for development process I would replace '$dispvm' with something like 'devel-vm'.

The qrexec_client_vm command, used above, is not actually intended to be used by user directly (that's why it's installed in /usr/lib/qubes instead of /usr/bin/), and so when one creates a Qubes qrexec service, it's customary to create also a small wrapper around qrexec, like this one, that makes using the service simple.

The presented converter saves the original file as .${original_pdf} making it a hidden file to help the user avoid accidental opening. The new, converted file gets .trusted.pdf suffix appended to the base name of the original file. I discuss more issues regarding the human factor and avoiding accidental opening in one of the next paragraphs below. The converter can also be used to convert essentially any image file, such as JPEG, PNG, etc, into a PDF, using the same method.

As you can see creating client-server services in Qubes is very simple – in fact it took me just one afternoon to get the inital working version of the converter (with subsequent "polishing" over the next 2 days).

The qrexec infrastructure takes care about all the under-the-hood tasks, such as starting the necessary VMs, e.g. creating Disposable VM to handle the service request,establishing communication channels between VMs (which are ultimately implemented on top of Xen's shared memory), redirecting client and server's stdin and stdout to each other, so that writing services is very simple, even in shell, and, of course, obeying policies defined centrally in Dom0.

Most “inter-VM” features in Qubes, such as secure file copy between domains, opening files in Disposable VMs, time synchronization, appmenus synchronization, etc, are all implemented on top of qrexec. A notable exception is clipboard exchange, which is implemented as part of the GUI protocol, but still uses the same common qrexec code for policy processing (e.g. I use this policy to block clipboard and file exchanges between my work and personal domains).

Limitations, other Simple Representations

The obvious disadvantage of converting a PDF to an RGB representation is that one looses text search, as well as copy and edit capabilities (e.g. in case of PDF forms). So, converting Intel's IA32 Software Developer's Manual this way would certainly not be a good idea... But, hey, such large PDFs can always be opened in a Disposable VM – they would be fully functional then, only that you would need to wait a few seconds for the PDF window to pop up. Or, better yet, why not keep all such PDFs in a dedicated domain? E.g. I have a VM called “work-pub” where I keep tons of various, publicly available PDFs, such as the mentioned Intel's SDM, as well as various chipset docs, conferences papers and slides, and generally lots of stuff. The key point is that all in this VM is public material (and also all is related to my work), so that I don't really care if any of those PDFs compromises my work-pub domain. In the worst case, I will revert the VM from backup and download any missing PDFs again from the web. They are public after all. 
 
But the PDF conversion described above comes extremely useful in case of all the various invoices, Purchase Orders, NDAs, contracts, and god-knows-what-else PDF documents, which I'm forced to deal with in my “work” domain (where my email client runs). Most of those are one pagers, or maximum a few pages long documents, so the fact that they got converted to a bitmap provides me with very little discomfort. At the same time I gain incredible freedom in opening all those documents natively in my work VM, without fearing that one of those invoices will comrpomise my work domain (which would be a rather sad thing for me, although the really sensitive stuff is still in some other domains ;)

An interesting question is, however, can we come up with another form of Simple Representation that would allow e.g. to preserve the text searching ability of the converted PDFs (and DOCs, PPTs)? Probably... yes. The choice of the Simple Representation should be thought of as of a trade-off between security and document's features preservation. I'm not an expert on PDF and DOC formats (and I'm not sure I want to be) but it seems plausible that one could disassemble PDF into simple pieces, select the really simple ones, send those pieces as a Simple Representation back to client, and have them reassembled back into a almost-fully-functional PDF. Here, again, the point is that the PDF parsing is done in isolated Disposable VM, while the reassembly in the trusted VM. Anyway, let me leave it as a exercise for the reader :)
 
Preventing user mistakes

Being able to right-click on a PDF file and have it converted into a trusted PDF is one thing. Having this mechanism adopted by users and actually making their daily computing safer, is another story.
Users will likely have hundreds of PDF spread over their home directories, and the real challenge is how to make sure that the user never accidentally opens the unconverted, untrusted PDF. We can think of several approaches to this problem:
  • We modify the Thunderbird, Firefox, etc, e.g. by providing specific plugins, to always perform PDF conversion on each file that we got via email or downloaded from the Web. Additionally we convert all the already present PDFs in the user's home directory (file system?). And, additionally, we modify Qubes file copy operation to also always do automatic PDF conversion whenever one transfers files from other domains (if Qubes qrexec policy allows for such transfer in the first place, of course).

This approach would not be optimal, because some PDFs, as we discussed above, might not be well suited for conversion-through-bitmap process – they might be large PDFs where text search is crucial, some conference papers for review, where text copy is crucial, or some editable forms. That's why it seems better to take a slightly different approach:
  • We modify mime handlers for PDF files (as well as any other files that our converter supports) and then upon every opening of the file (e.g. via mouse click in a file manager) our program gets to run and its job is to determine whether the file should be opened natively, converted to a trusted PDF, or perhaps opened in a Disposable VM. Of course, upon “first” opening we should probably ask the user about the decision, if this cannot be determined automatically. E.g. if we can reliably determine the file is already converted, we can safely open it without prompting the user, but if it's not, we should ask – perhaps the user would like to open it in a Disposable VM instead of converting, or perhaps the file should be considered trusted anyway, because it was created by the user herself.

This second approach seems like a way to go, and we will likely implement it sooner or later (probably sooner, but after the upcoming R2 Beta2). It should also be noted, that typically user would need such mechanism only in some domains – e.g. I really feel the need for such protection in my “work” domain, but not in any other. But that, of course, depends on how one partitions their digital life into security domains.

One important detail worth mentioning here, is that we should unconditionally disable “Thumbnail View” in whatever file manager we use (which itself is really a stupid feature – can people not read filenames anymore or something?).

Qubes: from containers isolation down to apps protection

The mechanism introduced today, in addition to the Disposable VMs mechanism introduced earlier, represents a trend in Qubes development of “stepping down” into AppVMs in order to also make the VMs themselves somehow more secure (in addition to the isolation between the VMs).

Originally Qubes aimed at containers isolation only. This included protecting the system TCB where techniques such as deprivileged networking stacks (and optionally also deprivileged USB stacks) have been deployed, as well as custom GUI virtualization, and generally somehow “hardened” Xen configuration. This also included protecting the VMs from each other, where techniques such as secure clipboard, secure file copy and generally secure qrexec infrastructure have been introduced, as well as trusted GUI subsystem with explicit domain decorations.

But now, Qubes is stepping down into the AppVMs in order to make the VMs themselves also less prone to compromise. We surely will be working on more such mechanisms in the future. We still are only at the beginning of the quest to create a Reasonably Secure Desktop OS!

PS. The presenetd converter will be part of the Qubes R2 Beta 2, that is expected to be released... in the comming days. Experienced users of Qubes R1 and R2 Beta 1 can install the converter immediately by building the rpms from the git repo.


PS. WTF is happening with the Blogger web interface? Seriously, I don't remember being so frustrated using any software in the recent years that I am right now, when editing this post (as well as the last several ones). It sometimes honours the line breaks, sometimes do not, sometimes inserts a couple of new lines, sometime removes them, sometime mysterious spaces appear at the end of lines, sometime those cannot be removed... It doesn't allow to paste pre-formatted code-listing (at least I couldn't figure out how to make it honour tabs). And yes, I'm using the "Compose mode", because when I try to switch to the HTML mode, not only I'm overwhelmed with tons of HTML markups, nobody knows what for, but also when I switch back to the Compose mode, my article tends to get even more fucked up! Really, a shame. I wish I could go away to some other blogging service, but I'm afraid that converting all my posts would be even a bigger PITA... Sigh.

23 comments:

  1. Your raw RGB format is actually extremely close to the Windows BMP 24 so you could have used that. The header only really contains width and height.

    You could keep a pre-loaded disposable VM in memory to make disp vm startup faster. I do not know at what point of the loading portion they need to be "specialized"

    ReplyDelete
  2. @nicolas: it's not "mine", it's ImageMagick's.

    Also @nicolas: how many Disposable VMs should you keep in a queue to look through a dozen of invoices in a folder?

    ReplyDelete
  3. The best way to parse PDFs safely is still probably to parse them safely with safe code. Applying a band-aid on a turd still leaves you with a turd. interesting ideas though. right now I use a sacrificial machine for pdfs - kind of a real world physical equivalent of your VMs, and a little less bother at a little more hardware expense.

    ReplyDelete
  4. I am just suggesting the same approach used to speed up other things (pre-loading tabs, linked pages, hard drive blocks,...). I assume that it could be tweaked based on available (meaning not used, or wasted...) memory. There is always a draw back.

    ReplyDelete
  5. @dragosr: I don't quite understand what criteria you used to state that "safe code is the best way to parse PDFs safely"? The absolute safety of so called "safe languages" is a myth -- ever wondered why Singularity never taken off?

    Also @dragosr: Using a "sacrificial machine" for PDF processing, whether it is real or virtual, is only good if you deal with unclassified, public PDFs only. But if you need to open a confidential contract, or NDA-protected documentation, then using a sacrificial machine would be irresponsible at best, and might even get you into legal problems if those confidential PDFs leak out of your "sacrificial machine", because every decent NDA would explicitly require you to treat the confidential material given to you with great care. And, then, on the other hand, it's also unreasonable to assume that every non-public PDF is non-malicious.

    ReplyDelete
  6. Joanna nice post!!!!

    Have you ever thought on the following solution ?
    http://www.sans.org/reading_room/whitepapers/intrusion/animal-farm-protection-client-side-attacks-rendering-content-python-squid_33614

    Its used python and squid to render the pdfs.

    ReplyDelete
  7. @Theodoros:

    1) Your solution is a centralized one -- not only it creates privacy concerns and cannot be used for end-to-end encrypted documents, but also is much less secure, as if your "safe" parser got compromised than it will be able to steal all subsequent documents (and leak them out to China) as well as infect clients by serving them malicious PDFs.

    2) Your approach to process PDF is what I described in the article as a black list solution, with all the usual problems such solutions have.

    In other words I fail to see how this solution could be better than just using a somehow "hardened" PDF viewer. Actually the latter would be better because we avoid the centralized single point of failure, i.e. your proxy.

    ReplyDelete
  8. ... of course we can also run such proxy in a Disposable VM, no problem, if somebody likes blacklisting approaches. Here the benefit would be that we don't have a centralized single point of failure that can later turn against us.

    ReplyDelete
  9. i think this, and qubes in general, is an interesting approach to overall security. however, the complexity of it all seems rather high.

    there have been lots of exploits for adobe products published but comparatively fewer for open source pdf viewers, e.g. xpdf. how do you justify the increase in complexity versus the likelihood of various attack vectors, e.g. owned windows machine with an adobe pdf exploit?

    ReplyDelete
  10. @Jake: if you consider RGB parsing to be "complex", then I wonder what do you possible consider to be "simple"? ;)

    ReplyDelete
  11. nice way of not answering my question. ;)

    ReplyDelete
  12. Wow, I understand your approach, despite copy-paste shell scripting "ability;" I'm going to try out Qubes, I need at least one software environment that works.
    Probably the most useful thing I can contribute is this: You need Free(as in GNU) hardware for this to end well, O level ownership from microcircuitry to microcode on up, otherwise fake security(opacity) will trump real security. And a clean electricity/EMF environment or it may well "Blogger" on you due to institutional resistance to the "dangerous" notion you have a right to do things on your computer unmolested;)

    ReplyDelete
  13. Hi Joanna,
    Nice post and an interesting approach. Some of the original motivation for the PDF format was saving space. The PDF is (to my knowledge) a vector format, which saves lots of space on big documents. Since you convert it back to RGB-based, it's uncompressed and I assume it'll take a lot of space, like BMP files mentioned by another comment. The next reasonable step in your approach would be IMHO to switch to better representation of graphical data, e.g. use JPG or others instead of "pure" RGB representation. I guess you've probably considered it by now.

    BTW I personally think that in terms of functionality and scalability, as you remarked with the Intel example, the solution must be white list, i.e. disassembling and building it back.
    The PDF is overall a good format, it just uses a little bit too much functionality. You're solution basically destroys the format completely :)

    ReplyDelete
  14. Alon: A jpg parser is significantly more complex than a raw RGB one. In fact one could say that raw RGB data needs no parsing, just read into memory and display.
    If space is a concern then a simple run-length-encoding scheme which is trivial to implement safely would improve the situation.


    Jake: It is the premise that is important here, not a specific implementation. XPDF lacks a lot of features that modern PDF viewers offer and even so, I do not think that one can say with certainty that it is "safe code" since nothing has been formally proven. Even if there are no publicly disclosed XPDF vulnerabilities, the fact remains that it could still be vulnerable.

    I think that people should focus more on the ideas and premises behind Joanna's posts (which are all solid and offer tangible security benefits, something that is extremely rare in the defensive domain) than trying to come up with more user-friendly (yet inherently flawed) alternatives to the specific scenarios that she brings up.

    Compartmentalization works because it allows one to think about risk and exposure and *factor* them into his everyday decisions. It may not be as simple as click-and-forget but it is the best solution we have today and Qubes does an excellent job of balancing the scales.

    ReplyDelete
  15. Actually the image compression code can be of arbitrary complexity, as one would normally be doing the compression in the trusted VM on the already verified RGB format. So, unless we fear attacks that could exploit a compressor by feeding it with "strange" bytes to compress (in contrast to feeding a decompressor with strange input file), a rather unthinkable situation for any real-world compressor IMO, we should be fine with any compressor.

    ReplyDelete
  16. Security by correctness seems more relevant here. The parser can be small enough to allow the demonstration of its correctness. I mean a formal demonstration like one would do with a mathematical theorem (taking into account the way numbers are represented, memory limitations, etc). (You can show that whatever data (not necessarily pdfs but any random data) is fed to your program, it will always behave as expected.)

    All this parser need to do is locate the safe areas (white listing) and transmit it to a second program which will turn it into a correct (striped) pdf file (or something different).

    All this assume we can trust system calls for disk access. If we can't, it means that storing, copying or reading it (eg. with an hexadecimal reader) will compromise the system.

    ReplyDelete
  17. @anonymous-who-thinks-one-can-formally-prove-a-pdf-parser: your comment made me laugh.

    Anyway, if you think this is all that simple, then where is the code? Everybody is good at talking, but few can actually write something, huh?

    ReplyDelete
  18. I am actually working on it (and other things more or less related) on my free time. I am still documenting myself on the pdf format.

    Code comes last. You need a good design first. Things can be made simple on purpose. If the white list is pretty restrictive and hardcoded in the parser, I am confident it is doable.

    One does not have to do it in one go. One can first demonstrate that a function does what it is meant to do considering its ins and outs. Once done one can just assume the function works correctly and considere the bigger picture without going back on it.

    Most of the time the proof is done by exhaustion considering a general case and a lot of special cases (boundaries, overflow, etc).

    Automatising it is hard but doing it by hand is relatively easy (with common program structure : no self modifying code, etc). It is just time consuming but with a program small enough it is manageable.

    A parser that copy what it recognize and ignore the rest can be fairly small.

    anonymous-who-thinks-one-can-formally-prove-a-pdf-parser (and made you laugh)

    ReplyDelete
  19. Let me say I admire your work Joanna and I am not a PDF expert either.
    Rather than go the "Raster" (RGB Bitmap) route I would go the "Vector" one and better yet using a Declarative e.g.

    XML-based format supporting both Raster and Vector data with robust parsers but without embedded Procedural Code

    like PDF and PS other than within well defined tags like "script" that can be readily stripped.
    This problem with embedded "code" exists for all these big major file formats including MS and Open Office files

    and such things as CAD files and will be a constant bugbear for security (and not limited to, but also including

    your Qubes). XML based file formats are a saving grace given the simpliity and ubiquity of tested parsers along

    with the data/code divide mentioned.
    Consider a series of transformations between formats with each one using a different parser that also to strip out

    all code and macros (and so there will be no dynamism or animated imagery in the regenerated PDFs for example). If

    you do enough transformations with enough code stripping then nothing is going to get through as an attacker would

    have to target multiple vulnerabilites accross multiple libraries in order to do so and such a chain can be

    arbitrary or random both in terms of components used and the length of the chain e.g.:

    PDF => ... => .... => PDF

    You are currently using Bitmaps but SVG would be a much better option:

    PDF => SVG => PDF

    There is open source available including multiple Apache PDF engines along with SVG2PDF and PDF2SVG

    projects available.

    Feel free to contact me offline to discuss further since I really wanted to follow Gandalf's advice and: "Keep it

    secret; Keep it safe..." before I actually started writing this...

    ReplyDelete
  20. GSview PStoEdit:
    http://pages.cs.wisc.edu/~ghost/gsview/pstoedit.htm
    Converter to Vector Formats:
    http://www.pstoedit.net/pstoedit/
    Including SVG:
    http://www.helga-glunz.homepage.t-online.de/plugins/

    Assuming you can get the source.
    I can't help with needing to spawn a Disposable VM to securely do the conversion... BUT you should need to: (A) you can have a list of one more DVMs pre-running ready to go (B) you could reuse such PDF->SVG conversion DVMs after some number of jobs or for a session sending them back to sleep to await the next conversion job... Unfortunately you might have to compromise security for your sanity here but the SVG (XML) output of a behind the scenes PDF Converter should not propagate infections...

    ReplyDelete
  21. Joanna

    the links to git.qubes-os.org all return 404. Would like to take a look at the wrapper scripts.

    thanks
    j

    ReplyDelete
  22. @J.M. Porup: the repo has been renamed since that time, hence the links are giving 404s. The repo is now here:

    http://git.qubes-os.org/?p=joanna/antievilmaid.git;a=summary

    You should be able to find all the referenced files in this new repo.

    ReplyDelete
  23. Sorry, wrong URL, this is the correct repo, of course:

    http://git.qubes-os.org/?p=joanna/qubes-app-linux-pdf-converter.git;a=summary

    ReplyDelete