Office 2007 features a new default XML file format that opens a new world of development scenarios to .NET developers. This article explains the developer benefits provided by the new Office Open XML File formats that will be released as part of the 2007 Microsoft Office System. After reading this article you will understand the architecture of the new file formats as well as understand how the new formats enhance Office development scenarios. In addition, the article discusses a sample application that illustrates how the new file formats enable Office document generation on the server.
Cool Thing #1: The New Files are Open and XML
The new file formats are a logical progression of the XML work done in the most recent versions of Office. The last three versions have incrementally increased the XML capabilities of Office applications to a point that, today, it is possible to generate Office documents through their respective XML specifications (i.e. WordML and SpreadSheetML) without manipulating the Word and Excel object models. Given the ubiquity of XML and the XML features already included in Office 2003, the new default file formats should be viewed as a positive step forward because it places XML front and center in Office
Architecture–It’s a ZIP file!
At first glance, the most obvious result of the new file formats is that they have a new, 4-character, file extension. For example a Word 2007 document is .docx, Excel 2007 is .xlsx, and PowerPoint 2007 is .pptx (see Sidebar 1 for full listing of the new file formats). In reality, any file created with these applications is just a standard ZIP file. You can change its extension to .zip, open it with your favorite ZIP file tool, and view the contents. Let’s take a look at a typical .docx file.
Figure 1. The Contents of a Word 2007 Document Package’s Word Folder
Take another look at Figure 1 and notice how the content of the word is split into a series of XML files. Each of these XML files is a document part that stores a specific portion of content. For example, all header content resides in the Headern.xml files, the fonts in the FontTable.xml file, the document text and WordProcessingML tags in the Document.xml file, etc. This architecture makes manipulating documents as easy as opening a ZIP file, finding the desired document part, and editing or swapping out its content. This is something I will explain in more detail at the end of the article.
The package contains all the parts that comprise a document and the parts contain the different elements that combine to build a document. It is the relationships that glue the different parts together by defining how they are linked together, thus they are extremely important. Below is an example relationships document named document.xml.rels that specifies the relationships for the Document.XML file contained in the document package.:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<Relationship Id="rId8" Type="http://schemas.microsoft.com/office/2006/relationships/officeDocument" Target="glossary/document.xml" />
<Relationship Id="rId3" Type="http://schemas.microsoft.com/office/2006/relationships/wordStyles" Target="styles.xml" />
<Relationship Id="rId7" Type="http://schemas.microsoft.com/office/2006/relationships/wordFooter" Target="footer1.xml" />
<Relationship Id="rId2" Type="http://schemas.microsoft.com/office/2006/relationships/wordLists" Target="lists.xml" />
<Relationship Id="rId1" Type="http://schemas.microsoft.com/office/2006/relationships/wordFontTable" Target="fontTable.xml" />
<Relationship Id="rId6" Type="http://schemas.microsoft.com/office/2006/relationships/wordHeader" Target="header2.xml" />
<Relationship Id="rId5" Type="http://schemas.microsoft.com/office/2006/relationships/wordHeader" Target="header1.xml" />
<Relationship Id="rId4" Type="http://schemas.microsoft.com/office/2006/relationships/wordSettings" Target="settings.xml" />
<Relationship Id="rId9" Type="http://schemas.microsoft.com/office/2006/relationships/theme" Target="theme/theme1.xml" />
This document defines nine relationships for Document.xml. For each relationship, the document defines both the content type and location of the related file. These relationship definitions are very important as part names do not persist across saves. Thus, when Word opens a document package it uses the relationships defined in the package to locate the document parts and build out the document in the Word UI.
Cool Thing #1 Key Takeaway: The Office Open XML file formats free developers from the tyranny of the binary file format. The new files are now wide-open and accessible from any application without manipulating the Office application word models. If you are tempted to say, "Big deal! I can do whatever I want with Office files already!", keep reading.
To Learn Even More: Download (and read!) the Microsoft Office Open XML Formats Guide here.
Cool Thing #2: The New Formats Make Anything Possible
Well they may not bring about world peace, but the new formats sure make developing solutions that include Office files a lot less stressful. No longer does a developer have to go through the Office applications just to make wholesale changes to one or more documents. Instead of needing to know the intricacies of an Office application’s object model, a developer only needs to know how to peruse the contents of a ZIP file (the document package) to find and edit the desired document parts (the various xml files included in the package). If you think about it for a few minutes it doesn’t take much brainpower to think of some interesting new Office-related developer scenarios.
Here are a few application ideas I thought of while researching this article:
- Project Document Generator: I am in the services business, and anytime I win a new project, I need to create several new documents: Contract, Work Order, Change Order Template, etc. With the new format, I could now build an application that allows me to select a customer from my CRM system, input a project name, and generate all required documents while I sip on my Starbucks.
- Report Writing: Office applications were never meant to be automated on a server. But by taking advantage of the new file formats, I could now generate 1,000s of Word, Excel, or PowerPoint files on the server without a worry. A good idea for an app would be a reporting application that e-mails Excel files that shows key activity metrics to insurance agents.
- Theme Library: This application would allow me to fulfill my creative design dream of selling document designs to Office users all over the planet. This application would allow a user to browse different themes from the Web or on their file system and apply them to the open document. Even better, they would be able to pick a theme and apply it to 100s or 1000s of documents at a time.
Each of these ideas is made possible (or made simpler) by the new, open file formats.
Cool Thing #2 Key Takeaway: The new XML file formats create new possibilities when developing solutions for your clients by freeing Office files from their binary format, as well as their host application’s object model. In addition, Microsoft has submitted the new file formats to ECMA for standardization. Once accepted, the ECMA standardization will encourage wide-spread implementation and adoption of the file formats by non-Microsoft applications. The end-goal here is to increase Office document interoperability across platforms and back-office applications.
To Learn Even More:: Download (and read!) the Microsoft Office Open XML Formats Architecture Guide here. Take a look at the ECMA proposal here. The initial draft of the proposal can be downloaded here.
Cool Thing #3: You Can add Your Own Content to the Document Package
Like all other Microsoft technologies, the document packages are architected to provide for extensibility. It is quite possible to include custom XML data inside the package and then reference the custom data inside the document package. The custom data will reside in a special folder named dataStore which resides in the package’s word folder (or excel for Excel, ppt for PowerPoint). This means it is still possible to attach XML schemas to documents and include dynamic XML data. This also means that at any time the custom XML data is updated, those changes will be reflected in the document.
Cool Thing #4: The System.IO.Packaging Namespace
If you haven’t been wondering how to manipulate the new file formats with code yet, you were bound to wonder soon. The answer is that the System.IO.Packaging namespace contains all the classes you need to code against the new formats. The new Office file formats are based on the Open Packaging Conventions (as is the XML Paper Format) which will be released with Windows Vista and is part of Windows Presentation Foundation.
Although you could use any tool that has the capability to manipulate ZIP files, you don’t need them as the System.IO.Packaging namespace is designed for this purposes–making manipulating the document package as simple as opening a ZIP file, querying the relationships for the desired content types, and then adding, editing, or deleting files.
The key objects (or at least for the sample discussed next) are defined in Table 1.
Table 1. Key Classes included in System.IO.Packaging.
Represents the docx, xlsx, pptx, etc., document package (or any other document that conforms to the XML Paper Format).
This is the top-level class and should be used to open a document package and browse its contents.
Represents a file object stored in a Package. This is typically the various XML file document parts, but can also be images, binary objects, etc.
This class contains all information for a referenced part including its relationships, location in the package, and content.
Represents a relationship between Package or PackagePart and a target PackagePart (i.e. Document.xml and Header1.xml)
By looking up a relationship to a package or packagepart, you can then find the location of other packageparts and navigate to them.
Contains methods for composing and parsing package objects.
Primarily used to locate a package part from a specified URI.
Cool Thing #4 Key Takeaway: The new file formats even come with a specialized set of classes contained in the System.IO.Packaging namespace. Theses classes provide a developer with the ability to create, modify, and delete document packages.
To Learn even more: Download the WinFX SDK here. Also read Kevin Boske’s blog (not many posts right now but I am sure more will follow). Kevin is the Office Programmability Program Manager so his blog is worth subscribing to.
Cool Thing #5: The New File Formats Are Easy to Build Upon (Sample Application Overview)
Once you get the hang of it, the new file formats are really simple to develop against (with one minor issue – see Dev Tip below). To demonstrate, I created a sample Web page that allows a user to pick a document and then select between several options for headers and footers (see Figure 2). Once the user makes their selections, they can press the "Build Document" button causing the Web page to open the selected document and insert the desired header and footer.
Figure 2. Sample Web Page
Listing 1 contains the complete listing of the code behind the page:
Listing 1. The default Page Code Listing of the Sample Web Application
Partial Class _Default
Private Sub InsertParts(ByVal filePath As String, ByVal PartName As String)
Dim relType As String = "http://schemas.microsoft.com/office/2006/relationships/officeDocument"
Dim partRelType As String
‘//Determine what type of part we are inserting
‘//and set reference to the appropriate relationship type.
Select Case PartName
partRelType = "http://schemas.microsoft.com/office/2006/relationships/wordHeader"
partRelType = "http://schemas.microsoft.com/office/2006/relationships/wordFooter"
Dim pkgPart As PackagePart
Dim docUri As Uri
‘//Open the package with Read/Write permission
Dim pkg As Package = Package.Open(filePath, IO.FileMode.Open, IO.FileAccess.ReadWrite)
‘//Get the start part…[Content_Types].xml
Dim rel As PackageRelationship
‘//Peruse the relationships of document type
For Each rel In pkg.GetRelationshipsByType(relType)
Dim u As New Uri("/", UriKind.Relative)
‘//FInd the full path of the Target URI
docUri = PackUriHelper.ResolvePartUri(u, rel.TargetUri)
‘//Retrieve the part
pkgPart = pkg.GetPart(docUri)
‘//If we are at the Document part, take action.
If pkgPart.Uri.OriginalString = "/word/document.xml" Then
‘//Find and delete the current Part
Dim rel2 As PackageRelationship
‘//Peruse the relationships of DOcument.XML
‘//to find the desired part.
For Each rel2 In pkgPart.GetRelationshipsByType(partRelType)
Dim partURI As Uri = PackUriHelper.ResolvePartUri(docUri, rel2.TargetUri)
Dim hdrPart As PackagePart = pkg.GetPart(partURI)
‘//Delete the exsiting Part
‘//Create a new package part that will store the new part
Dim pkgPartNew As PackagePart = pkg.CreatePart(partURI, System.Net.Mime.MediaTypeNames.Text.Xml)
‘//Insert the contents of the new part
Dim fs As FileStream = New FileStream(lstFooters.SelectedValue.ToString, FileMode.Open, FileAccess.Read)
‘//Close saves the package.
Private Sub CopyStream(ByVal source As Stream, ByRef target As Stream)
Dim size As Integer = source.Length
Dim bytes(size) As Byte
Dim numBytes As Integer
numBytes = source.Read(bytes, 0, size)
While numBytes > 0
target.Write(bytes, 0, numBytes)
numBytes = source.Read(bytes, 0, size)
Protected Sub btnBuildIt_Click(ByVal sender As Object, ByVal e As System.EventArgs) Handles btnBuildIt.Click
The application logic follows this flow:
- The click event of the Web page’s button control (btnBuildIt) makes two calls to the InsertParts method included in the pages class. In each call, I specify the type of part (header or footer) and the selected document’s location. The lstDocuments control provides the path of the document via its SelectedValue property.
- The InsertParts method first determines what type of content will be inserted and sets variables for the specified content type. Next the method opens the document package and finds the location of the specified part and deletes it using a Package object.
- In order to insert the new part, the method uses a PackagePart object to create a new document part and saves it to the same location in the package as the one deleted in step #2.
- In the last step, the method calls the CopyStream method and passes a FileStream object that represents the content of the part to be inserted as well as a Stream object representing the blank package part created in step #3. CopyStream takes the two Stream object and copies the source into the target. I defined the target as ByRef so that any changes made are reflected in the passed object.
Figure 3. The Before and After Versions of the Selected Document
Cool Thing #5 Key Takeaway: Manipulating the new file formats in code is a simple process that does not require a lot of effort.
Office 2007’s new Open XML file format provides extensive capabilities to a developer. As this article has pointed out, there is a lot to be excited about and to learn.