Transcription

PDF/Ain a NutshellLong Term Archiving with PDFOlaf Drümmer, Alexandra Oettler, Dietrich von Seggern Accessibility Contracts and Forms High-volume PDF/A creation PDF/A with Acrobat 8 ProfessionalPDF/A Scanning documents to create PDF/ACompetence Center PDF/A from Microsoft Office 2003 and 2007

Olaf Drümmer, Alexandra Oettler, Dietrich von SeggernPDF/A in a NutshellLong-Term Archiving with PDF

Olaf Drü[email protected] [email protected] von [email protected]: 978-3-9811648-1-7Bibliographic information published by Die Deutsche BibliothekDie Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie.Detailed bibliographic data is available at http://dnb.ddb.de .This work and all its parts are protected by copyright. All rights, including translation, reproduction, presentation, use of illustrations and tables, radiobroadcasting, microfilming, any other means of replication, and storage in data processing systems, are reserved. This also applies to extracts. Anyreplication of this work or of parts thereof, even in isolated cases, is only permissible in accordance with the currently valid version of the Germancopyright legislation of September 9th 1965. A copyright fee must always be paid. Violations fall under the prosecution act of German Copyright Law. 2007 callas software GmbH, BerlinPublished by Association for Digital Document Standards ADDS – PDF/A Competence Center, Berlin – www.pdfa.orgTranslation: 2007 Association for Digital Document Standards ADDS – PDF/A Competence Center, BerlinPrinted in GermanyThe use of general descriptive names, trade names, trademarks, and so on, in this publication, even if not specifically identified, does not imply thatthese names are not protected by the relevant laws and regulations or that they can be used by anyone.Layout, design, and composition: Alexandra Oettler; Cover design: Anja Godolt; Cover picture: Sepp Huberbauer – photocase.com/dePrinting: Galrev Druck- und Verlagsgesellschaft Hesse & Partner OHG

PrefaceOur world is getting more digital by the day.A lot of information and documents onlyexist in digital form today, but will they stillbe legible „tomorrow“? That was the themeof an interesting TV show appropriatelycalled „The Digital Disaster“. It began withcave drawings from the stone age and papyrus rolls from ancient Egypt, both of whichhave survived as documents for thousandsof years. What documents from the 21stcentury will future generations be able tofind and still read? But it‘s happening muchquicker than you may realize. I always carrya 3½ inch floppy disk in my pocket, and itdemonstrates a lot of the problems of longterm archiving. It begins with the hardware:where can you buy a 3½ inch floppy disk today? And even if you find one, there‘s a goodchance that the disk is physically damaged.If these two hardware hurdles are successfully cleared, then what kind of software ordocument will we find on the floppy disk?Are the appropriate viewing and processingprograms still available? And this exampleis a mere 15 years old!My short anecdote leads us to the demand on the long-term archiving of documents. Electronic archiving is critical forbusinesses and organizations, because documents today often only exist in digital format. The length of time that business documents have to be archived varies from sector to sectors and country to country, butsome examples can help us to get an idea.Federal laws often requires an archivingperiod of around 10 years. Banks and insurances demand that customer dossiers beretained for more than 50 years. In the engineering branch, archival periods of 100years are common for aircraft, bridgeshopefully hold a whole lot longer.And saving documents in proprietry formats for this length of time is really not agood idea. This leads to the second problemwith the digital document world - that manyusers already have a real „format zoo“, whichcan quickly become unmanageable (if it isn‘talready so). Proprietary document formatsPDF/A in a Nutshell have to be migrated on a regular basis, in order that newer versions of the processingsoftware can still read them.Employees working on customer dossiers aren‘t really impressed when 10 different viewing programs are opened up at thesame time. In some of the programs theymight not even know how to navigatearound in a document. In order to solvethis problem, a document and archivingformat is needed that guarantees the required long-term archiving period and offers the option of a single format type.This is where PDF/A as an ISO standardfor long-term archiving enters the stage.The „A“ stands for „Archive“ and the PDF/Astandard was specifically created for longterm archiving. It envisions a single PDF/Aarchive for all documents in an organization, from input through to output, and includes all of the areas inbetween.You will find many more advantages toPDF/A on the following pages, written withthe aim of converting the very formal ISOstandard into a form that is easily understood and enhanced with practical examples. Since PDF/A resolves a lot of the critical problems that users have, the PDF/ACompetence Center was formed as an association with the aim of providing information over PDF/A, promoting the distribution of the standard, and acting as a central point of contact for your questionsdealing with PDF/A. We hope that thisbooklet gives you a good overview and introduction to PDF/A, and also helps as amotivator for implementing the standard.Berlin, in September, 2007Thomas Zellmann,Chairman PDF/A Competence CenterPS: a special thanks goes out to our member callas software GmbH, who initiatedthe German version of this booklet andprovided it to the PDF/A Competence Center for translation into English and for further distribution.3

PrefaceThroughout history, it has always been important to preserve our past for future generations. Until the last 20 years in our papercentric world, this was a fairly easy task.One would simply take the folders of papers or other objects that were to be preserved and send them off to an archive forsafe keeping or place them in a fire retardant container. With electronic documentsthis task is not as easily approached, whichis how PDF/Archive or PDF/A came intobeing.PDF/Archive addresses the growing needto electronically archive documents in away that would ensure preservation of theircontents over an extended period of time.Additionally, it ensures that the documentswill be able to be retrieved and renderedwith a consistent and predictable resulteach time they are viewed.AIIM, the Enterprise Content Management Association, and NPES – The Association for Suppliers of Printing, Publishing and Converting Technologies were approached by numerous organizationswhich were being faced with the need topreserve over long periods of time, largequantities of electronic documents. Afterreviewing the options of maintaining thiselectronic history in TIFF, XML, nativeformat or PDF, it was decided that PDFwould be the best format as it would enablethe accurate rendering of the document asit had been intended to be displayed. However, in order to ensure the long term preservation of the electronic documents, PDFwould need to be enhanced slightly.4The joint effort of AIIM and NPESbrought together the document and content management experts with the graphicsexperts who had already developed thePDF/X family of standards. When we announced the proposed work to develop asubset of PDF tags for long-term preservation of electronic documents, we were overwhelmed by the interest to participate fromvirtually every area in the world.AIIM’s expertise as an accredited standards developer and the secretariat of ISOTC 171, Document Management Applications and ISO TC 171 SC2, Document Applications, AIIM brought to the project themeans for gaining ISO approval and wideradoption of the standard. ISO 19005-1,Document management – Electronic document file format for long-term preservation– Part 1: Use of PDF 1.4 (PDF/A-1) becamean approved ISO standard within 22months of introduction as a new projectthrough the dedicated efforts of many records managers, archivists, software developers and end users.While adoption of the standard has beena little slower than we had anticipated, weare encouraged by the continuing interestand growing adoption of the standard. Thisbook along with the continuing efforts ofAIIM and the PDF/A Competence Centrewill continue to increase the adoption rateof PDF/A in the industry.Silver Spring, in September, 2007Betsy FanningAIIM, Director, StandardsPDF/A in a Nutshell

Table of ContentsTable of ContentsDurable documents with the PDF/A standardOpen files are not always completeTIFF as an archive format99PDF data containersWhy PDF/A and not PDF?The introduction of the PDF/A standard101111How to create archive PDFsWho stands to benefit from PDF/A?1213Table: Comparison between PDF/A-1a and PDF/A-1b15Overview: Which file formats are suitable for archiving?16Is XPS an alternative to PDF/A?18PDF/A creation: Analog, digital, and mass processingIllustrations: PixelQuelle.dePDF/A from scanned documentsScanning options in Acrobat 8 ProfessionalConverting pages that have already been scanned to PDF/A212223The Distiller enginePDF/A document generation using the Distiller2525Office and administrationPDF/A in Office 2007Office 2003 and the PDFMakerPDF/A using the 3-Heights PDF Producer28282931PDF/A ‘en masse’PDF/A ‘from nothing’Creating PDF/A from print data streams323233PDF/A in a Nutshell 5

Table of ContentsFrom PDF to PDF/A: Converting PDFs to archive PDFsPDF/A generation with Preflight34Converting PDF to PDF/A with pdfaPilot37Is this really a PDF/A file? PDF/A validationValidation with Preflight39pdfaPilot PDF/A41Archive PDFs in everyday life: What issues might arise?Illustrations: photocase.com/de6ImagesResolution is not part of the PDF/A standardPermitted and prohibited compression F/A and metadata5050AccessibilityCreating an accessible PDF file from Word5254Interactive PDF filesComments and annotationsFormsEmbedding fonts for PDF/A forms56565859PDF/A for design drawings60Electronic signaturesSecurity levelsDigital signatures in PDF with AcrobatChallenges in practice61626364PDF/A in a Nutshell

Table of ContentsThe outlook: PDF/A in the futureEnhancements in PDF/A-2Looking towards PDF/A-3PDF/A-1 developmentsPDF/A in one hundred years time65666667What the error messages meanPreflight results and troubleshooting for PDF/A68GlossaryExplanation of terms relating to PDF/A80About:The PDF/A Competence Center86AIIM87PDF/A in a Nutshell Sepp Huberbauer – photocase.com/de7

1.Durable documents with thePDF/A standardThere are certain documents that peoplewant to keep because of their sentimentalvalue: Love letters, photographs of theirfirst day at school, or holiday snaps, for example. Other documents have to be keptfor legal reasons. These document includebirth certificates, academic certificates andreports, invoices that are needed for taxpurposes, insurance documents, and contracts.In the days when everything existed onpaper – in the pre-digital era – the mainproblem was remembering which indexfile, folder, or shoe box you’d used to storeyour letters or contracts. In today’s world ofdigital documents, the task of archiving isfundamentally different. Thanks to searchfunctions or database solutions, even themost forgetful of us can easily find a par-ticular document or photo on our computers. In addition, any possible space problems can be solved simply by purchasingadditional RAM. However, there are certain risks and uncertainties that might influence the shelf life of digital documents.These risks do not only arise from the physical durability of the data carriers used although it is clear that magnetic tape, CDROMs, and DVDs will not necessarily lastany longer than paper and ink. However,photographic prints dating from 1900 stillexist today. Still, it’s debatable whether ornot we will similarly be able to view themillions of digital snapshots being takenand stored on mobile phone memory cardsall over the world in, for example, 2107.In addition to the restrictions imposedby the limited lifetime of data carriers, theMarkus Imorde – photocase.com/de8PDF/A in a Nutshell

Durable documents with the PDF/A standarddocument format and software used also displayed as required. Instead, the framepresent a considerable challenge for the du- where the image should appear displaysrability of electronic documents. Yester- only a rough preview of the image or aquestion mark. The problem of open filesday’s, today’s, and tomorrow’s softwareIt’s a common problem: Opening old for which not all illustrations and fontsdocuments in brand-new programs are available has been causing irritatingdoesn’t always work. The rate of success delays for printers and their suppliers for afor the opposite direction (new documents long time. However, the introduction ofin old programs) is even less encouraging. PDF, a format that can store all the comSoftware developers do try to achieve ponents required for a printed document,backward compatibility that enables files has greatly simplified work in this area. Inthat are, say, five years old to be opened addition, layout files such as XPress or Inusing a current program release. However, Design are now becoming increasinglythis can change the layout and page ren- less common in printers’ archives. Instead,dering, meaning that not everything is printers are storing the actual PDF docudisplayed exactly as it ought to be. More ments that were used for the printingrecent software tends to generate docu- task.ments with additional features that olderversions may not be able to display. In TIFF as an archive formatsome cases, it is not even possible to open For a long time, many public authoritiescurrent files in previous versions of a pro- and companies that need to store largegram. For example, whereas a Microsoft quantities of correspondence, records, inWord 95 file can normally be opened in voices, contracts, and similar informationin digital archivesWord 2003, it is nothave been usingpossible to open a"The successful long-termthe pixel imageWord 2003 documentTIFFin Word 95.archiving of digital files is at least format(TaggedImageFileBecause softwareas threatened by the constantFormat).Thisforproduction cycles arerollout of new program versionsmatdigitalizesbecoming ever shorteras by damaged data or datatemplates contain– one major releasecarriers."ing text and imagper year is not unusuales pixel by pixel.– the challenge thatTIFF is an estabarises from new program developments is greater than that lished image file format that has both adcaused by the aging of storage media. The vantages and disadvantages. Pixel-basedsuccessful long-term archiving of digital formats store the appearance of templates.files is at least as threatened by the constant Problems with missing graphics and fontsrollout of new program versions as by dam- do not occur, since the format stores all ofthe template elements as an image. Sinceaged data or data carriers.TIFF is widespread and is subject to fewfile handling complications when upgradOpen files are not always completeFile formats are not all equally suitable for ing to a new program version, many usersthe long-term, secure archiving of content. believe that the future of the format isIf it is not possible to store all the elements guaranteed. However, while TIFF may inrequired for the complete display of con- deed be a de facto standard, it is not antent in a file format – graphics and fonts as official norm for safe archiving. Other diswell as text – then the possibility of stum- advantages include the relatively large filebling blocks when it is attempted to use size and the fact that scanned texts cannotthe file later on cannot be ruled out. If, for be searched without OCR (text recogniexample, the program used cannot find tion), since this format converts them tolinked external images, a page cannot be image elements. PDF/A in a Nutshell TIFF-G4 – a black and whiteTIFF variant that works with acompression method developed for fax technology – iscommonly used for archiving.9

Durable documents with the PDF/A standardPDF data containersThe development of PDF (Portable Document Format), which Adobe Systems hasbeen promoting since 1993, has significantly simplified data management andexchange for a great number of users fromcompletely different fields. PDF allowsobstacles that can arise during the transmission or storage of files to be neatlyavoided.One of the reasons for theglobal success of PDF must beconsidered to be the freeavailability of the AdobeReader. This PDF viewer isavailable for download fromAdobe Systems’ Web site inmany language versions fornumerous commonly tep2.html PDF files can be opened on all established operating systems. Free PDF readersare available for all of the important platforms including Windows, Apple, Linux,and mobile devices. With this format, the document layoutis true to the original. Since PDF can incorporate different types of content such astext (and the relevant fonts), images, andgraphics, nasty surprises relating to miss-PDF specifications:Since it was introduced at the start of the 1990s, the PDFfile format has been in a state of constant development.The current PDF specification is version 1.7, which wasintroduced with Acrobat 8. Today, it is extremely rare tocome across PDF files with a version number lower than1.3, and modern PDF generation programs only havebackward compatibility to version 1.3 at the most.With each PDF version, Adobe Systems publishes a reference that describes the features and functions of the version in detail. The specification history contains ‘milestones’ – important features that were introduced withthe new version. Some of these milestones are listed below.Acrobat 1 (1993, PDF 1.0): PDF 1.0 incorporates mostof the functions offered by the page description languagePostScript Level 2. All basic functions for text, vectorgraphics, and raster graphics are available.Acrobat 2 (1994, PDF 1.1): This version supports theLab color space and CalRGB. It also supports TrueTypefonts.Acrobat 3 (1996, PDF 1.2): This version enablescolor separation and supports Unicode and CID fonts(Chinese, Japanese, and Korean). It also supports ZIPcompression.10ing illustrations or incorrect fonts – likethose that occur when Word documentsare opened on another computer – are notusually a problem. PDF is an open format. This means thatcompanies other than Adobe Systems (whoinvented PDF) can develop software forcreating or displaying PDF. The ‘release’ ofPDF by Adobe has brought independencefor both users and developers and, as a result, there is a high probability that therewill still be programs for generating anddisplaying PDFs in decades to come.So, can users who want to keep documents such as contracts or invoices for longperiods of time trust in PDF to make surethat their documents will work just as wellin ten, fifteen, or one hundred years time asthey do today? It might well be the case thatPDF files created today will still work without any significant problems in 2017. HowAcrobat 4 (1999, PDF 1.3): PDF 1.3 contains the complete PostScript Level 3 graphics model. It enables multichannel color spaces (DeviceN) and supports ICC profilesfor the reliable reproduction of colors. It introducessmooth shades and page geometry boxes, which are useful for prepress processes (TrimBox, CropBox, and BleedBox).Acrobat 5 (2001, PDF 1.4): From this version, PDFfiles can contain transparency. This version also introduces ‘tagged PDF’ ( structured PDF), which enablescontent accessibility. The security options are enhancedwith this version. In addition, the image compressiontype JBIG2 is supported.Acrobat 6 (2003, PDF 1.5): With this version, PDFdocuments can contain layers (also called ‘optional content’). JPEG2000 image compression is supported.Acrobat 7 (2004, PDF 1.6): This version supportsOpenType fonts. With this version, 3D content can be inserted. Users can create virtual page sizes with edges ofup to 381 km in length.Acrobat 8 (2006, PDF 1.7): Unicode path specifications simplify the correct specification of links, evenacross international language systems. The new Acrobat‘PDF packages’ function allows several independent PDFdocuments to be forwarded in a single file. The recipientrequires Acrobat or Reader 8.PDF/A in a Nutshell

Durable documents with the PDF/A standardever, only the new PDF/A standard canguarantee that users will be able to view exactly the same content as when their documents were created. This format brings thekind of legal certainty that can be decisivein many business and administrative contexts.Why PDF/A and not PDF?Why has a special PDF standard now beendefined for archiving documents? Are traditional PDF documents not ‘good enough’for long-term archiving? PDF has someexcellent characteristics that lend themselves to the creation of archive documents. Like a container, a PDF can incorporate completely different elements suchas text, images, and fonts. In addition, itreproduces layouts that are true to theoriginal and is cross-platform capable.However, certain requirements must bemet in order to enable the exact reproduction of content. Required: One ‘must’ is that users require full access to all elements belonging to a document. For example, fontsmust be embedded – a link to the font inquestion is not sufficient. This meansthat if, in 10 years time, a user who triesto open a document does not have a required font on his or her computer, special characters or symbols will not be displayed correctly. Prohibited: In addition, some PDF features must be avoided. Such elements areprohibited because they would underminethe required document durability, and include interactive elements and PDF layers.These features inhibit the unambiguity thatis required from an effective PDF/A file.For example, in the case of a PDF document with layers, users printing it out in 50years time might well ask themselves whichlayers are valid and which are not. Thiskind of decision needs to be made now –when the PDF is created.A PDF/A document is basically a traditional PDF document that fulfills preciselydefined specifications. In order to preventPDF/A in a Nutshell users from repeatedly having to test anddiscuss the best appearance of a wellfunctioning archive PDF, industry expertsdecided in 2002 to work together to develop the PDF/A standard.The introduction of the PDF/A standardThe PDF/A standard for long-term archiving was adopted by ISO (InternationalOrganization for Standardization) in autumn 2005. The PDF/A standard was published with the number ‘ISO 19005-1:2005’and is based on PDF specification 1.4. Anadditional part, PDF/A-2, is currently being prepared. This part shall refer to PDFVersion 1.7.The PDF/A standard aims to enable thecreation of PDF documents whose visualappearance will remain the same over thecourse of time. These files should be software-independent and unrestricted by thesystems used to create, store, and reproduce them. As far as PDF/A is concerned,practice soon caught up with theory.While Acrobat Professional 7 containedonly ‘draft’ PDF/A functions, Acrobat 8,which has been available since the end of2006, now offers creation and verificationfeatures that comply with the adoptedstandard.Many new PDF/A tools and solutionsfor creating and verifying files have entered the market since the introduction ofthe standard – from ‘small’ tools for individual users who want to create PDF/Adocuments every now and again to extensive server solutions that can create a hundred thousand archive documents fromdatabases in just a few hours time. ISO is an international organization forstandardization, active primarily in technical and electronic fields. The PDF/A standard was developed by industry and development experts.PDF/ACompetence CenterInternational companies and experts fromthe field of PDF technology have joinedforces to form the PDF/A Competence Center. It aims to promote the exchange of information and experiences relating tolong-term archiving. Users can visitwww.pdfa.org for up-to-date advice andbackground information as well as a discussion forum on PDF/A.PDF/A has two levels of compliance:PDF/A-1a (Level A) applies to semantic correctnessand structure. Each character must have a Unicodeequivalent. The structure is expressed by tags.PDF/A-1b (Level B) applies to visual integrity.Any file that meets the requirements for PDF/A-1a willalso comply with PDF/A-1b, which is less strict.11

Durable documents with the PDF/A standardgenerated from working files such as Wordor PowerPoint files.Converting PDF to PDF/A: Acrobat 8 Professional provides an export function forPDF/A in addition to other formats.How to create archivePDFsThere are many different conditions thatmight be encountered when creating PDF/Afiles. The process differs depending onwhether existing PDF documents are already available or whether they need to beConverting paper documents to PDF/A:Scanned documents can be automaticallygiven searchable text following digitalization. Text recognition software is used forthis.12 PDF/A files from files or data: This fieldrelates to newly created PDF files from applications including word processing, image editing, and layout programs. The process can be realized by means of a PDF export from the source program, AcrobatProfessional, Distiller, or other PDF converters. For the mass conversion of contentto PDF/A, there are program modules thatcan convert database content or print datastreams to the PDF/A format. Converting scanned paper documents to PDF/A: Often, documents thatexist only on paper, such as contracts, invoices, and books, need to be digitalizedusing a scanner. Over the past years, the results of the scanning process have usuallybeen stored as Bitmap TIFFs. However,PDF is increasingly being used for scanneddocuments, and before long the majority ofscanned files will probably be stored directly as PDF/A files. For example, userscan scan paper documents using Acrobat 8Professional and save them as PDF/A files.It is often possible to make the text in aPDF/A file searchable using the text recognition function (OCR). Images and histori-Dirk Herold – photocase.com/dePDF/A in a Nutshell

Durable documents with the PDF/A standardcal documents can also be scanned for conversion to PDF/A. Solutions and servicesfor mass processing are available for userswho wish to scan a large number of pagesor documents. Creating PDF/A from PDF: Many usersalready have PDF documents that are notPDF/A-compliant. It is often not possible torecreate such documents from the sourceprogram because, for example, they werenot created locally but were sent to the userin question by e-mail. There are severalmethods for converting PDFs to PDF/As.Acrobat 8 Professional is one of the applications that can be used. However, Adobeis not the only company to market softwarefor this particular task. There are many different products on the market, rangingfrom single-user solutions to systems forhigh throughput. Is this really a PDF/A file? When working with PDF/A on a daily basis, file verification is also important. Is it sensible tobelieve the sender of a PDF document whenhe or she says that it’s a PDF/A file? Beforereceived files are saved in an archive, theymust be checked to make sure that they arePDF/A-compliant. There are various toolsthat enable file verification: In addition toAcrobat 8 Professional, there are other applications including Berlin-based callassoftware’s pdfaPilot, which enables the verification and creation of PDF/A files as wellas providing some additional functions.Who stands to benefit from PDF/A?Many sectors and professions have beenwaiting for a PDF standard for archiving. Itis useful not only for archives, administrative departments, industry, and commercebut also for research and teaching. Manydifferent types of content can be saved asPDF/A files. Below are a few randomly selected examples from various fields. Saving e-mails as PDF/A: Today, moreand more correspondence, some of it of acontractual nature, is being sent by e-mail.Anyone who has switched from one e-mailprogram to another knows the difficultiesinvolved in transferring oldmail to the new system. SincePDF/A is a safe format, itmakes sense to save e-mail archives on back-up media inthe form of PDF/A at regularintervals. Saving brochures, manuals, and information sheetsas PDF/A: Many companiesand public authorities alreadymake a large quantity of information available in theform of PDF downloads. Whynot create these documents inthe future-proof PDF/A format straight away and distribute them as PDF/As? PDF/A validation with Preflight: The Preflight validation and correction tool is part of Acrobat 8 Professional. It generates PDF/A files and checks existing PDF/A documents tomake sure that they comply with the standard.PDF/A in a Nutshell 13

Durable documents with the PDF/A standard Accessible PDF files: In America, accessibility in the digital world has been anissue for a long time – especially for the Internet. Enabling the accessibility of information to visually impaired members ofsociety is now also on the agenda in Europe. Since PDF/A specifically supportsstructured content in PDF documents, it isideal for processing accessible PDF documents that can be read out by screen readers. Storing print documents as PDF/A:Printers and prepress companies will berelieved to hear that the PDF/X standardthat is widespread in their industry sectors is completely compatible with the newPDF/A standard. A PDF document can besimultaneously PDF/X and PDF/A-compliant.Design dra

Open files are not always complete 9 TIFF as an archive format 9 PDF data containers 10 Why PDF/A and not PDF? 11 The introduction of the PDF/A standard 11 How to create archive PDFs 12 Who stands to benefit from PDF/A? 13 Table: Comparison between PDF/A-1a and PDF/A-1b 15 Ov