Concatenating several
entire docx
Concatenating parts of
several docx
Controlling Headers and
Footers
Interaction between
ODD_PAGE and Page Number restart
overrideTableStyleFontSizeAndJustification
This chapter explains how to use the MergeDocx functionality, which is capable of appending/concatenating docx files together to create a single docx file. For example, to place a cover letter and a contract into a single docx file, without changing the look/feel of either document.
A BlockRange is essentially a WordprocessingMLPackage, or a range of content in a WordprocessingMLPackage, plus config settings.
To merge docx files, you invoke DocumentBuilder with List<BlockRange>:
List<BlockRange> blockRanges = new ArrayList<BlockRange>();
blockRanges.add( new BlockRange( wordMLPkg1 ) );
blockRanges.add( new BlockRange( wordMLPkg2 ) );
// etc
// Perform the
actual merge
DocumentBuilder documentBuilder = new DocumentBuilder();
WordprocessingMLPackage output =
documentBuilder.buildOpenDocument(blockRanges);
You can fine tune the merge process by configuring individual block
ranges, or the DocumentBuilder object, as described in the SETTINGS section
below.
The
samples directory contains an example called MergeWholeDocumentsUsingBlockRange which you can use as a starting
point.
Alternatively, there is a webapp which can generate code for you, based on your chosen configuration.
Note: there is also a static method you can use to merge a List<WordprocessingMLPackage>, but that is not recommended since it precludes user config of DocumentBuilder and individual BlockRanges.
If you invoke DocumentBuilder with List<BlockRange>, obviously all your BlockRanges are in memory at once. DocumentBuilderIncremental is a more memory efficient approach which avoids this. See MANY DOCUMENTS further below.
If you wish to use only a certain part of the documents, you need to invoke DocumentBuilder with a List<BlockRange>
BlockRange associates a range with a WordprocessingMLPackage.
The org.docx4j.wml.Body element has a method:
public
List<Object> getEGBlockLevelElts()
which contains the "block-level" document content (paragraphs, tables etc).[1]
BlockRange constructors let you say you want the contents starting from the nth element onwards:
/**
* Specify the source package, from
"n" (0-based index) to the end of the
document **/
public
BlockRange(WordprocessingMLPackage wordmlPkg, int n)
or count elements from the nth element:
/**
* Specify the source package, from
"n" (0-based index) and include
"count"
* block-level
(paragraph, table etc) elements. **/
public
BlockRange(WordprocessingMLPackage wordmlPkg, int n, int
count)
or the entire docx:
/**
* Specify the entire source package. **/
public
BlockRange(WordprocessingMLPackage wordmlPkg)
For example:
List<BlockRange> blockRanges
= new ArrayList<BlockRange>();
blockRanges.add(new
BlockRange(wmlPkgIn)); //
add all
blockRanges.add(new
BlockRange(wmlPkgIn, 0, 6)); // paras 0-5
blockRanges.add(new
BlockRange(wmlPkgIn, 6)); // paras
6 onwards
DocumentBuilder documentBuilder = new
DocumentBuilder();
WordprocessingMLPackage output =
documentBuilder.buildOpenDocument(blockRanges);
The result is a new WordprocessingMLPackage containing the specified portions of the source documents.
The samples directory contains an example called MergeBlockRangeFixedN.
Where you want to use the nth element constructors, how do you determine n? See Determining the nth element towards the end of this document.
You may use the one WordprocessingMLPackage in more than one BlockRange. For example:
List<BlockRange> blockRanges
= new ArrayList<BlockRange>();
blockRanges.add(new
BlockRange(wmlpkg1, 12));
blockRanges.add(new
BlockRange(wmlpkg2, 3, 3));
blockRanges.add(new
BlockRange(wmlpkg1)); // Use wmlpkg1
again
You must not however, use a BlockRange object twice. For example, the following is an incorrect usage:
BlockRange blockRange1 = new
BlockRange(wmlpkg1, 12);
List<BlockRange> blockRanges
= new ArrayList<BlockRange>();
blockRanges.add(blockRange1);
blockRanges.add(new
BlockRange(wmlpkg2, 3, 3));
blockRanges.add(blockRange1); // Incorrect
As explained above, BlockRange constructors let you say you want the contents
starting from the nth element onwards:
/**
* Specify the source package, from
"n" (0-based index) to the end of the
document **/
public
BlockRange(WordprocessingMLPackage wordmlPkg, int n)
or count elements from the nth element:
/**
* Specify the source package, from
"n" (0-based index) and include
"count"
* block-level
(paragraph, table etc) elements. **/
public
BlockRange(WordprocessingMLPackage wordmlPkg, int n, int
count)
The question arises as to how to work out these numbers.
There are three approaches for finding the relevant block:
· manually
· via XPath
· via TraversalUtils
TraversalUtils is the recommended approach. This is mainly because there is a limitation to using XPath in JAXB (as to which see below).
Explanations of the three approaches follow.
Common to all of them however, is the question of how to identify what you are looking for.
· Paragraphs don't have ID's, so you might search for a particular string.
· Or you might search for the first paragraph following a section break.
· A good approach is to use content controls (which can have ID's), and to search for your content control by ID, title or tag.
The examples provided show how to do each of these. They can be readily adapted for other cases, such as before or after a table or image. If you have any difficulties with your particular case, please do not hesitate to ask for support.
Manual
approach
The manual approach is to iterate through the block level elements in the document yourself, looking for the paragraph or table or content control which matches your criteria. To do this, you'd use org.docx4j.wml.Body element method:
public
List<Object> getEGBlockLevelElts()
XPath
approach
Underlying this approach is the use of XPath to select JAXB nodes:
MainDocumentPart documentPart =
wordMLPackage.getMainDocumentPart();
String xpath = "//w:p";
List<Object> list =
documentPart.getJAXBNodesViaXPath(xpath, false);
You then find the index of the returned node in EGBlockLevelElts.
Beware, there is a limitation to using XPath in JAXB: the xpath expressions are evaluated against the XML document as it was when first opened in docx4j. You can update the associated XML document once only, by passing true into getJAXBNodesViaXPath. Updating it again (with current JAXB 2.1.x or 2.2.x) will cause an error. So you need to be a bit careful!
TraversalUtils
approach
TraversalUtil is a general approach for traversing the JAXB object tree in the main document part. TraversalUtil has an interface Callback, which you use to specify how you want to traverse the nodes, and what you want to do to them.
TraversalUtil can be used to find a node;
you then get the index of the returned node in EGBlockLevelElts.
Examples are in the samples directory, named as follows:
|
Manually |
via XPath |
via TraversalUtil |
String |
MergeBlockRangeN |
MergeBlockRangeN ViaXPathString |
MergeBlockRangeN ViaTraversalUtils String |
SectPr |
MergeBlockRangeN ViaManualSectPr |
MergeBlockRangeN ViaXPathSectPr |
MergeBlockRangeN ViaTraversalUtils SectPr |
Content control |
MergeBlockRangeN ViaManualContentControl |
MergeBlockRangeN ViaXPathContentControl |
MergeBlockRangeN |
The approach described above doesn’t allow you to insert contents into a table
cell.
To do this, you can either use the class ProcessAltChunk as described next page below, or you can use a placeholder to indicate where you want a BlockRange to be inserted.
The placeholder is a content control
containing a
<w:tag w:val="MergeDocx:BlockRangeIDREF=myTableContent"/>
in this case referencing a BlockRange
having ID “myTableContent”.
Inside a table cell, the complete placeholder
would look something like this:
<w:tc>
<w:sdt>
<w:sdtPr>
<w:tag w:val="MergeDocx:BlockRangeIDREF=myTableContent"/>
</w:sdtPr>
<w:sdtContent>
<w:p>
<w:r>
<w:t>My placeholder7</w:t>
</w:r>
</w:p>
</w:sdtContent>
</w:sdt>
</w:tc>
The BlockRange which will
be placed at this location, is given a matching ID:
blockrange.setID("myTableContent");
You may then
invoke DocumentBuilder in the usual way.
The result will be that the contents of the table cell are replaced with
the contents of the block range.
This works in a similar way to the way AltChunk processing works (see next page); in both cases you can insert the block range at locations where block/paragraph-level content is allowed.
For the best practice approach,
please see the end of this section. The
interim content works up to that by describing alternatives.
The simplest approach is to add your
ID’d block ranges to the blockRanges list before the ‘real’ documents:
List<BlockRange>
blockRanges = new
ArrayList<BlockRange>();
BlockRange block;
// Define
insertions
block = new BlockRange(insertionDocx,1,1);
block.setID("MySourceId");
blockRanges.add( block );
// Now add
inputDocx1 proper
block = new BlockRange(inputDocx1);
blockRanges.add( block );
// Perform the actual merge
The reason for this is that if instead the block range(s) being moved is/are last, then after it is/they are moved, the sectPr at the end of the previous block range is left untouched, and is now adjacent to the document level sectPr. (The step of moving things around is the very last step in the MergeDocx process).
The downside of having your ID’d block ranges at the start of the blockRanges
list, is that certain document wide
defaults come from there.
If you have them at the end of the blockRanges
list, you‘ll get two sectPr elements
at the end of the document (the first belonging to the immediately prior block
range, and the document level one an artifact from the block range which was
moved). For example:
<w:p>
<w:pPr>
<w:sectPr
w:rsidR="00D37ADB">
<w:pgSz
w:h="16838" w:w="11906"/>
<w:pgMar
w:gutter="0" w:footer="708" w:header="708"
w:left="1440" w:bottom="1440" w:right="1440"
w:top="1440"/>
<w:cols
w:space="708"/>
<w:docGrid
w:linePitch="360"/>
</w:sectPr>
</w:pPr>
</w:p>
<w:sectPr>
<w:pgSz
w:h="16838" w:w="11906"/>
<w:pgMar
w:gutter="0" w:footer="708" w:header="708"
w:left="1440" w:bottom="1440" w:right="1440"
w:top="1440"/>
<w:cols
w:space="708"/>
<w:docGrid
w:linePitch="360"/>
</w:sectPr>
This is harmless enough, but if you
wanted to fix it, you could in your own code programmatically delete the
document level one, and you could also promote the sectPr from the last
paragraph.
You wouldn’t want to setSectionBreakBefore(SectionBreakBefore.NONE) on the block range not being moved, since although you’ll end up with only one sectPr, it is the wrong one!
Finally, here is the best practise:
like so:
WordprocessingMLPackage
pkg1 = WordprocessingMLPackage.load(new File(file1));
BlockRange source1 = new BlockRange(pkg1);
source1.setSectionBreakBefore(SectionBreakBefore.NONE); // note this
BlockRange
tableContent = new BlockRange(WordprocessingMLPackage.load(new File(file2)));
tableContent.setID("myTableContent");
List<BlockRange>
sources = new ArrayList<BlockRange>();
sources.add(source1);
sources.add(tableContent);
// Add pkg1 again for our body level sectPr
BlockRange emptyBR = new BlockRange(pkg1, 0, 0); // none of the contents - just sectPr
sources.add(emptyBR);
altChunk is a way of telling a consuming application that certain content is to be included in the document.
For further details, please see http://blogs.msdn.com/b/ericwhite/archive/2008/10/27/how-to-use-altchunk-for-document-assembly.aspx
Word 2007 understands what to do with an altChunk.
docx4j doesn't, unless you use the MergeDocx utility (or write or own code). If your docx contains altChunks, it is important to be able to resolve them if you want to generate HTML or PDF output using docx4j.
MergeDocx handles altChunk of type docx, as opposed to html or plain text. Support for altChunk of type xhtml is available in the docx4j-ImportXHTML jar.
The
class ProcessAltChunk contains a method:
public static
WordprocessingMLPackage process(WordprocessingMLPackage srcPackage) throws
Docx4JException
which will process docx altChunks in the Main Document Part (document.xml)
There is also the option to specify how styles are handled:
/**
* Process srcPackage, replacing all alt
chunks of type docx (as
* opposed to HTML etc), with proper document
content.
*
* @param srcPackage
* @param styleHandler
StyleHandler.USE_EARLIER or RENAME_RETAIN
* @return
* @throws Docx4JException
* @since 3.2
*/
public static WordprocessingMLPackage process(WordprocessingMLPackage srcPackage,
StyleHandler styleHandler) throws Docx4JException
Limitations/recommendations:
·
We recommend you avoid setting
headers/footers in your altChunk.
Microsoft Word does strange things when an altChunk contains headers/footers;
currently, MergeDocx does not attempt to duplicate this behaviour.
·
altChunk elements in parts
other than the Main Document Part (eg headers/footers, footnotes/endnotes and
comments) are not converted.
Any comments and footnotes/endnotes in the altChunk should get added OK.
If you want to delete part of a docx, including the parts it references but which will no longer be used, you can use the constructor:
/**
* Specify the source
package, from "n" (0-based index) and include
"count"
* block-level (paragraph, table etc) elements. **/
public BlockRange(WordprocessingMLPackage
wordmlPkg, int n, int count)
twice on the one input document, adding the bit before the stuff to be deleted, and the bit after it.
Background
The Open XML specification includes a technology called “Custom XML data binding”, which can be used in document automation and reporting scenarios to automatically inject data from an XML document of your choosing into your docx.
If a content control has an XPath, that XPath is used to retrieve the matching element from your XML document.
OpenDoPE (Open Document Processing Ecosystem) is a set of conventions for tagging a content control to enable:
· conditional content
· repeating content (eg rows of a table, or a bulleted or numbered list)
docx4j is the reference implementation of OpenDoPE.
MergeDocx
support for OpenDoPE
You can use MergeDocx and OpenDoPE together. Support for combining these technologies was significantly improved in MergeDocx v1.5.0
You can use MergeDocx first, and then docx4j’s OpenDoPEHandler.
Or you can use docx4j’s OpenDoPEHandler first, then MergeDocx.
Either order is supported, but it is probably more efficient to use MergeDocx first, followed by OpenDoPEHandler. If you plan use MergeDocx first, and your documents include compound conditions (ie and|or|not operators), you must use docx4j 3.0.
MergeDocx is designed to ensure that each of the input docx uses its own OpenDoPE parts and XML answers, without interfering with the other input docx.
There are two approaches to supplying the XML answer files.
The first approach is to inject an appropriate answer file into each input docx before invoking MergeDocx and OpenDoPEHandler. This is the approach which would be familiar to OpenDoPE users.
A second approach is to tell MergeDocx a Map of W3C DOM Documents containing answers which are to be used across the input documents. The map is keyed by root element QName. With this approach, you can skip the preliminary step of injecting real XML data into each input docx.
For example, suppose you were merging 3 documents, of which2 used an answer file with root element <supplier> and one used an answer file with root element <specification>.
With:
Map<QName, org.w3c.dom.Document>
answerDomDocs
you can set:
documentBuilder.setOpenDoPEAnswers(answerDomDocs);
and the values supplied will be used in preference to whatever XML part (with corresponding root element QName) is in the input docx.
There is a helper class OpenDoPeRegistration, which adds an InputStream representation of your XML, to the Map<QName, org.w3c.dom.Document>.
Map<QName, org.w3c.dom.Document>
answerDomDocs = new
HashMap<QName, org.w3c.dom.Document>();
InputStream is = FileUtils.openInputStream(new File("supplier.xml"));
OpenDoPeRegistration.register(answerDomDocs,
is);
is = FileUtils.openInputStream(new File("specification.xml"));
OpenDoPeRegistration.register(answerDomDocs,
is);
documentBuilder.setOpenDoPEAnswers(answerDomDocs);
OpenDoPE processing of rich text fragments
OpenDoPE also allows you to bind to an XML node containing:
In both cases, docx4j will convert that to docx content.
In the Flat OPC XML case, it converts it to an AltChunk (see previous section). MergeDocx can then convert the AltChunk to native document content.
If you invoke DocumentBuilder with List<BlockRange>, obviously all your BlockRanges are in memory at once.
If you are merging many documents, or even a smaller number of large documents, you may run out of memory.
DocumentBuilderIncremental is intended to help in this situation. It allows you to work with a single BlockRange at a time.
Example of usage:
DocumentBuilderIncremental dbi = new DocumentBuilderIncremental();
for
(int i = 0; i < MAX; i++) {
BlockRange block =
getBlockRange(i); // Your method
block.setSectionBreakBefore(BlockRange.SectionBreakBefore.NEXT_PAGE);
if
(i==0) {
block.setHeaderBehaviour(BlockRange.HfBehaviour.DEFAULT);
block.setFooterBehaviour(BlockRange.HfBehaviour.DEFAULT);
} else
{
// Avoid creating unnecessary additional header/footer parts
block.setHeaderBehaviour(BlockRange.HfBehaviour.INHERIT);
block.setFooterBehaviour(BlockRange.HfBehaviour.INHERIT);
}
System.out.println(i);
dbi.addBlockRange(block,
i==(MAX-1) ); // 2nd param is whether this is your last docx
}
WordprocessingMLPackage output = dbi.finish();// Get the output docx
In the example above, the headers/footers are taken from the first document only. This avoids creating potentially thousands of header/footer parts, where just a couple suffice.
MergeDocx ensures that each document is separated by a section properties element. The relevant properties are actually contained in the first sectPr element in the second of any two BlockRanges.
In other words, if 3 documents are concatenated, and each is just a single section, the resulting document will contain 3 sections.
By default each section starts on a new page.
If you want to avoid the page break, use BlockRange's setSectionBreakBefore method:
BlockRange blockRange1 = ...
BlockRange blockRange2 = ...
// avoid page
break
blockRange2.setSectionBreakBefore(SectionBreakBefore.CONTINUOUS);
etc.
The MergeBlockRangeFixedN sample utilizes this.
Your choices for the SectionBreakBefore property are:
· NONE
· NONE_MERGE_PARAGRAPH
· NEXT_PAGE
· NEXT_COLUMN
· CONTINUOUS
· EVEN_PAGE
· ODD_PAGE
With the exception of "NONE" and "NONE_MERGE_PARAGRAPH" these mirror values available in Word.
Since what happens between documents is controlled by the first sectPr in the second of the two documents, MergeDocx will set the first sectPr in the second document with the value specified. If there is no sectPr, it will add one at the end of the BlockRange and set that.
"NONE" is a bit different. In this case, no sectPr will be added, and nor will any existing sectPr be altered. So you can think of it as "unspecified". NONE can be useful if you want to manipulate sectPr values in your own code.
"NONE_MERGE_PARAGRAPH" will attempt to merge the last paragraph of the previous block range with the first paragraph of this one.
If you leave the propery unset, MergeDocx will add a sectPr if one is not present. MergeDocx will not set its type. If the type is not set, the default is NEXT PAGE, according to the OpenXML spec.
Note: in Word, by default, ODD_PAGE is not honoured if you have set page numbering to restart. Please see the section after Page Numbering below for details as to how to control this behaviour.
Also, Word will ignore a “continuous” setting, and insert a page break, if it detects that the page sizes of the two contiguous sections are different. This can produce unexpected results where, for example, both page sizes are intended to be A4 portrait, but specified in units which differ (for whatever reason) by a few mm. The sample NormalizePageSizes contains code which demonstrates how to address this issue.
Suppose you are merging docx1 and docx2.
The default behaviour is as follows:
· If docx1 has a header, and docx2 does as well, then by default both sets of headers will be used.
· If docx1 has a header, but docx2 doesn't, then by default the pages from docx2 will be shown using headers from docx1.
You can override this behaviour:
· if you want no headers defined in the first section of docx2:
blockRange2.setHeaderBehaviour(HfBehaviour.NONE);
· if docx2 has headers defined in its first sectPr, but you want to ignore them and use the headers from docx1:
blockRange2.setHeaderBehaviour(HfBehaviour.INHERIT);
There is a similar method for controlling footer behaviour, called setfooterBehaviour.
Suppose you are merging docx1 and docx2, and showing page numbers or cross referencing to page numbers.
Unless docx2 explicitly restarts page numbering, the numbers will continue on from those in docx1.
You can make the page numbering restart with:
blockRange2.setRestartPageNumbering(true);
If you are using page numbering of the form "page n of <total pages>" and you want <total pages> to reflect the number of pages in the relevant original document (rather than the number of pages in the resulting merged document), you should change your source documents so that they refer to <Total Number of Pages in Section>. See further http://support.microsoft.com/kb/191029
This will work provided each source docx has a single section. If the source documents have multiple sections, you will need to put a bookmark on the last page of each, and use a reference to that as the total number of pages.
If you have front matter you wish to exclude from the number of pages, you need to do a calculation[2]:
· If you know the number of pages in the front matter (and it will not change), then you can use Page { Page } of { = { NumPages } - x }, where x is the number of pages in the front matter. For example:
(toggle field
codes to see)
· If not, then you insert a bookmark on the last page of the document and use a PageRef field to reference the page number of that bookmark instead of the NumPages field.
The default behaviour of MergeDocx is to produce an output docx which contains no macros.
You can configure DocumentBuilder to retain the macros present in one of the source documents. To do this, you need to be using the BlockRange approach.
DocumentBuilder contains:
/**
* With this setting, you can embed
macros from one of the input documents, in the output docx.
* Without it, macros will simply be
ignored.
* The macros come from the docm
or dotm underlying the specified BlockRange.
* The setting will be ignored if a docx
or dotx underlies the specified BlockRange.
* @param br
*/
public void setRetainMacros(BlockRange br)
So you can do something like:
documentBuilder.setRetainMacros(blockRanges.get(2));
to keep the macros from docm/dotm underlying the 3rd BlockRange.
If MergeDocx finds macros in that block range, the resulting output document will be set to be of the same type (ie docm or dotm). It is your responsibility, when saving your output WordprocessingMLPackage, to save it with the correct filename extension. If a docm is saved with a docx extension, if you try to open it in Word 2010, you will an error similar to the following:
So you need to ensure you use the correct filename extension.
With MergeDocx, you can use the settings described above to have each new document start on the right (recto) page, with numbering starting again from one:
block.setSectionBreakBefore(SectionBreakBefore.ODD_PAGE);
block.setRestartPageNumbering(true);
Microsoft Word will not however, honour this combination, unless the docx is “tweaked” to make it do so.
There are two different ways MergeDocx can tweak the output docx in order to have Word behave as expected. You’ll need to experiment with both approaches; this is best done by physically printing the output from Word to your printer or to PDF. (You can print 4 pages per side to save paper, and still see what is going on.)
The first is:
documentBuilder.setSectionBreak_ODD_PAGE(
BEHAVIOUR_SectionBreak_ODD_PAGE.MIRROR_MARGINS);
This is the cleanest approach, and should
be used where possible. For it to work,
you need to ensure your first docx being merged has a document settings part
(since the mirror margins setting is stored in that part, and MergeDocx gets
that part from the first docx).
The second is:
documentBuilder.setSectionBreak_ODD_PAGE(
BEHAVIOUR_SectionBreak_ODD_PAGE.FIELD_IF_MOD);
If you use this approach, MergeDocx will insert an arcane field into your docx before appropriate sections (hit Shift F9 to see field codes):
The table below summarises the advantages and disadvantages of each approach:
MIRROR_MARGINS |
+ doesn’t
introduce fields into the docx |
- may not work if
documents contain both portrait and landscape pages; see http://support.microsoft.com/kb/185528 - first docx must
have a document settings part for this to work (you can add one with docx4j
if it doesn’t) - single setting
per docx (though the other approach is the same in practice) |
FIELD_IF_MOD |
+ suited to a
mixture of portrait and landscape pages + can be
adjusted to include “this page intentionally left blank” |
- PDF output
systems (other than Word) are less likely to support |
When documents using the "same" numbering are merged, by default, the numbering will continue, not restart.
This is useful if you are merging chapters of a book, or sections of a contract, and you want the numbering to continue.
Sometimes however, you may want to force the numbering to restart. To do this, you instruct MergeDocx to add new lists, rather than re-using existing lists.
To do this, NumberingHandler to ADD_NEW_LIST:
BlockRange blockRange1 = ...
BlockRange blockRange2 = ...
source2.setNumberingHandler(NumberingHandler.ADD_NEW_LIST);
The default is USE_EARLIER_IFF_SAME. "same" means the formatting definition is the same (ie they look the same), and the list is based on the same abstract numbering definition identifier (nsid).
There is a third option, USE_EARLIER, which will use a list with the same nsid from an earlier BlockRange, irrespective of whether it looks the same. The numbering will continue, not restart. For example if the numbering of the list in the first BlockRange was decimal, and the second BlockRange contained a list with the same nsid but roman numbering, applying the USE_EARLIER to the second BlockRange would cause its numbering to be decimal (rather than roman).
By default, if a style is encountered which is already defined in an earlier BlockRange, that earlier definition will be used. If the definition is different, this will cause the appearance of text using this style to change.
If the documents you are merging were styled independently, you will probably want them to retain their individual look. This can be accomplished by importing the styles (and renaming them so they don't collide).
To do this, setStyleHandler to RENAME_RETAIN:
BlockRange blockRange1 = ...
BlockRange blockRange2 = ...
source2.setStyleHandler(StyleHandler.RENAME_RETAIN);
Known limitation regarding Table of Contents: consider a style which will be renamed. A TOC field which refers to that style will not be updated to use the new name. This means entries in the table of contents will go missing.
If the document contains numbering, you'll also want to :
source2.setNumberingHandler(NumberingHandler.ADD_NEW_LIST);
(The default option is USE_EARLIER).
Since merging documents can take some time (depending on the number and complexity of the documents), the possibility exists (new in 3.1.0) of performing the merge in the background, and receiving notification when the job is complete.
See the MergeDocxProgress sample for an example of usage.
As per that example, you need to:
This is done as follows:
// Creation of message bus
MBassador<Docx4jEvent>
bus = new MBassador<Docx4jEvent>(
BusConfiguration.Default());
// and registration of listeners
ListeningBean listener = new ListeningBean();
bus.subscribe(listener);
// tell Docx4jEvent to use your message bus for notifications
Docx4jEvent.setEventNotifier(bus);
The sample class
contains an example ListeningBean. Note
the @Handler annotation.
Docx4j’s approach to
event monitoring relies on the MBassador library; see further
https://github.com/bennidi/mbassador
For another example of
monitoring events (docx load, save), please see https://github.com/plutext/docx4j/blob/master/src/samples/docx4j/org/docx4j/samples/EventMonitoringDemo.java
The styles part of a docx contains an element called w:docDefaults. Example contents:
<w:docDefaults>
<w:rPrDefault>
<w:rPr>
<w:rFonts w:asciiTheme="minorHAnsi" w:eastAsiaTheme="minorEastAsia" w:hAnsiTheme="minorHAnsi" w:cstheme="minorBidi"/>
<w:sz w:val="22"/>
<w:szCs w:val="22"/>
<w:lang w:val="en-US" w:eastAsia="ko-KR" w:bidi="ar-SA"/>
</w:rPr>
</w:rPrDefault>
<w:pPrDefault>
<w:pPr>
<w:spacing w:after="200" w:line="276" w:lineRule="auto"/>
</w:pPr>
</w:pPrDefault>
</w:docDefaults>
These are the
basic/root settings, on which the formatting/appearance is based. See further below for tips on
seeing/manipulating w:docDefaults
When documents are merged, there can only be one w:docDefaults element.
If one or more blockrange have StyleHandler.RENAME_RETAIN (that is, you want to retain the existing look of each individual document), or incremental processing is being used, we merge the properties in doc defaults into the styles (with the exception of paragraph spacing – see further below).
In a document created
in Word, the settings part, by default contains:
<w:compatSetting w:name="overrideTableStyleFontSizeAndJustification" .. w:val="1"/>
but this may vary by
input document.
Where it is false, then
anything in a table where font size 11/12 or jc left came from the Normal style
was ignored (in favour of whatever the table style specified).
In the output docx,
this is always set, so paragraph styles do override table styles.
Where that wasn’t true
in a particular input document, appropriate adjustments are made.
Consider the above
example, where w:docDefaults contains a
setting for w:spacing
<w:pPrDefault>
<w:pPr>
<w:spacing w:after="200" w:line="276" w:lineRule="auto"/>
</w:pPr>
</w:pPrDefault>
This is a special
case, because if this is merged into a style used in a table, it will affect
table row heights:- Word applies different layout rules inside a table cell,
depending on whether this setting is in w:docDefaults or a paragraph style.
So w:spacing is not copied from w:docDefaults
Only the value from
the first BlockRange is used, and if there are differing values in subsequent
input documents, that information is lost.
So for best results,
you should ensure each input document uses the same w:spacing setting in its w:docDefaults (no setting for w:spacing is a good option).
Microsoft Word
provides ways to edit your document defaults, but no easy way to be sure what
the settings are (since the Word interface conflates the default paragraph
style (eg Normal) and DocDefaults/pPrDefault!).
To see the actual
settings, we recommend looking at the raw XML.
There are a few different ways to do this:
In Java
// Given WordprocessingMLPackage
org.docx4j.wml.Styles styles = (org.docx4j.wml.Styles)wmlPkg.getMainDocumentPart().getStyleDefinitionsPart().getJaxbElement();
System.out.println(
org.docx4j.XmlUtils.marshaltoString(styles.getDocDefaults()));
or just:
System.out.println(
wmlPkg.getMainDocumentPart().getStyleDefinitionsPart().getXML() );
They’ll be at the top.
or use the Docx4j Helper Word Addin (v3.3)
Clicking that, you’ll see your w:docDefaults in an editor window:
If you edit the XML then click the apply button, the result will be a
new docx containing your new settings.
or, unzip the docx,
then open styles.xml
or use the webapp,
to navigate to the styles part
or, if you have
Visual Studio,
use the Open XML Package Editor for Visual Studio: https://visualstudiogallery.msdn.microsoft.com/450a00e3-5a7d-4776-be2c-8aa8cec2a75b
With that you can drag your docx onto Visual Studio, then navigate the
tree to the styles part.
You can edit and save your changes.
With some of the above
approaches, you can edit your w:docDefaults.
Alternatively, you can
do this in Word:
·
To set paragraph
level doc default properties, right click then choose “Paragraph” from the
context menu.
You should see:
The key is the "set as default" button.
·
To set run
level doc default properties, right click then choose “Font” from the context
menu.
Again, when you have things set as you wish,
click the "set as default" button.
[1] Since docx4j 2.7.0, you can also use the ContentAccessor interface (which is supported by various objects):
public
List<Object> getContent()
[2]
http://www.eggheadcafe.com/microsoft/Word-Page-Layout/35979216/total-page-number-minus-number-of-pages-in-front-matter.aspx
http://wordribbon.tips.net/T010604_Field_Reference_to_Number_of_Prior_Pages.html