1 2 3 4
Optimizing XWiki

Once you've got an XWiki up and running (whether you imported a Mediawiki or not), you'll find you want to tweak the standard rollout a bit.

Speeding up XWiki

After working a while with XWiki, you may notice it getting slower. Our XWiki was kind of slow from the get-go and we pretty quickly figured out why: the slowdown was caused almost entirely by the pretty, DHTML list of all pages in the panel on the right side. It apparently takes quite some time to build this list (and ours included hundreds of pages from the import); it will have to go.

We edited the side-panel to replace the culprit Panels.SpaceDocs with Panels.Spaces. This still gives us the ability to navigate directly to a spaces WebHome, but doesn't waste time building up enormous lists of pages that we almost never used.1 Navigate to your Main-Page and enjoy your lightning fast XWiki.

Changing the default editor

Since the WYSIWYG editor trashed some of our converted pages (we never investigated why exactly) we switched the default editor to the XWiki editor.

This can be done for all users by setting the default editor on the Administration/Preferences/Editing page. Each user can override this behaviour on their own user page.

Changes to the default skin

Although the newest XWiki seems have a skin other than Albatross as a default, we updated the current skin to fix up the following things that annoyed us:

  • The default heading sizes are simply enormous
  • The comments/attachments boxes are way too bulky, showing up as huge blue splotches at the bottom of every page.
  • The default formatting for <pre> and <code> blocks takes up way too much space
  • Tables have enormous margins and are bright blue
  • Table-of-contents entries (displayed with the #toc macro) show up with bullets -- even when they're numbered.
  • The print-preview page has a thick, black border around it and it always prints the XWiki logo rather than the logo from the skin
  • The "page not found" marker is over-the-top and distracting
  • Definition lists have only default HTML formatting, which gives the definition block way too much left margin

Our fixes for these things are shown below. YMMV.

Setting Defaults

The first step is to create a custom skin (as described under "changing the logo for the Albatross skin"). Once you've done that, from the same form that you set the logo property, you can set your own style (it should be the second property in the form).

You can type any valid CSS into this box and it will be included in all of your XWiki pages. In fact, if this property is non-empty, it is the only skin-CSS included by XWiki. It sounds ominous; what does it mean? It means that you must copy the entire Albatross skin-CSS to your custom skin in order to at least start off with the base look-and-feel. The best way to do this is with the following standard CSS import2:

@import "http://yourwiki/xwiki/skins/albatross/style.css";

After this import statement, feel free to include any of the CSS snippets below.

Fixing Headings

Each heading has its own CSS-class associated with it. You have to set each and every one and our settings are shown below. We used values that matched our CI, more or less.

.heading-1 {
  margin:12pt 0pt 3pt;
.heading-1-1 {
  margin:12pt 0pt 3pt;
.heading-1-1-1 {
  margin:12pt 0pt 3pt;
  font-weight: bold;
.heading-1-1-1-1 {
  margin:12pt 0pt 3pt;
  font-weight: bold;
.heading-1-1-1-1-1 {
  margin:12pt 0pt 3pt;
  font-weight: bold;

Fixing Code and Preformatted Blocks

As with most of the other updates, this fix involves removing a lot of padding and margins, as well as toning down the coloring. We use the "Consolas" font by default because we all have Office 2007 installed (you'll want other defaults, like "Monaco" if you're using OS X). Code blocks still have a light-grey background color, whereas pure preformatted blocks no longer have any color or border at all.

.code {
border:1px dotted #336699;
font-family:"Consolas",courier new,monospace;

.code pre {
padding: 0px;
margin: 0px;
border-width: 0px;
background-color: #EEEEEE;

pre {
background-color: white;
border: 0px;

Fixing the Table-of-Contents

This is an easy one: simply remove the bullets.

.tocEntry {
list-style-type: none;

Fixing Tables

Tone down colors; remove margins and padding; make the hurting stop. Note the use of the !important CSS directive to force the style to be used over another, more-specific style defined in another stylesheet.

.wiki-table th, .wiki-table td {
padding: 2px;

.wiki-table th {
padding: 4px 2px;
font-size: 100%;
font-weight: bold;
background-color: #EEEEEE;
border-bottom: 1px dotted #336699;

.wiki-table {
margin: .5em 0em;
border: 1px dotted #336699 !important;
background-color: #EEEEEE;

Fixing Definition Lists

Make definition terms bold by default; fix up the margins for the definition body. Definition lists are really useful in wikis, so it's important that they look good.

dt {
  font-weight: bold;

dd {
  margin: .15em 1.5em .75em 1.5em;

Fixing Comments/Attachments

These areas benefit from the fixes for tables made above, but they still need a bit more massaging before their color and margins are subdued enough -- they are, after all, not the main elements on the page. The fixes involve setting the remaining backgrounds/background-colors to none or transparent.

.xwikiintratitle {
padding:8px 0px;

#xwikidata #attachmentscontent, #xwikidata #attw, #xwikidata #commentscontent, #xwikidata #commentscontent .xwikititlewrapper
  background: none;

  margin: 1em 0em;

This fix removes the background color and turns the question mark brick-red, which is much more subtle than the default.

  background: transparent;
  color: #CC0000;
  font-weight: bold;

Fixing Print-Preview

And, finally, printing: here we remove the print header entirely3 and take out the borders on the two main containers. The border change will affect the non-printing view as well, but it has no visible affect.

  display: none;

  border: 0px;

  border: 0px;

  1. Another solution would be to build a custom panel that uses AJAX to retrieve the list of pages (paginated, of course) for the space you selected. It could even have a drop-down to search/limit the pages that are shown. Unfortunately, this panel doesn't exist yet and we didn't have time to build it.

  2. Another way to update the stylesheet is to put all of your styles into your own file on your XWiki server, then import that file instead.

  3. The logo is hard-coded and cannot be easily replaced, though you could probably do so by replacing some page, panel or code extension somewhere

From MediaWiki to XWiki part II


In part I we talked about how to get the information out of our old wiki and tranform it to the new format.

The harvested and converted content of our old MediaWiki is now in a flat directory located on our hard drive. But these hundreds of files would require a serious copy-paste action on our behalf, which we're not willing to participate in. Thus we're getting some help from groovy.


XWiki has an XML-RPC interface which allows us to sneak our pages in. Using groovy to talk to this interface will spare us having to write huge chunks of custom code. You can open a so-called XMLRPC-proxy to talk directly to an XMLRPC-API. Just create the proxy and use it like a COM-Object with late-binding (and without the other hassles of COM-Objects):

serverProxy = new XMLRPCServerProxy("http://myserver/xwiki/xmlrpc/confluence")
token = serverProxy.confluence1.login(username, password)

The code above will authenticate us on our XWiki running on "myserver". Token is the authentication-token that we'll have to pass to XWiki with every action.

Now we're going to import our exported MediaWiki based on regular expressions over the page-title:

// enumerate files 
new File(dirname).eachFileMatch( ~"${pattern}" ) { f ->
  try {
    spaceAndFile = "${space}." + f.getName();
    println "Importing ${spaceAndFile}...";
    page = serverProxy.confluence1.getPage(token, spaceAndFile )
    page["content"] = f.getText().replaceAll( "MySpacePlaceholder", space );
    serverProxy.confluence1.storePage(token, page);
  } catch( Exception e) {
    println "Cannot upload the page!:\n ${e}"
    throw e;

The variables used in this piece of code are:


The directory our pages are located in.


Filename pattern. In part I we exported all of our pages, converting to the XWiki-markup using the page-title (URI) as the filename.


The name of the XWiki-space to import the pages into.

Every occurence of "MySpacePlaceholder" is replaced by space and the source-file is deleted after the successful import. If you don't care about spaces at all you'd want to specify the space as "Main" and the pattern as "*".

You can fetch the whole file (with commandline handling, etc.) here: import.groovy script.


If you're happy with only one space, this solution will suit you fine. Go to the directory to which you've exported the wiki-content and run the following line:

groovy import.groovy . * Main

This will import all articles in the current directory and put them into the "Main" space.

If you'd like to sort the articles in advance (like we did) you'll have to wait for my next article, describing a more sophisticated replacement of the MySpacePlaceholders.

TechTip: Was ist eine Firewall?

Der Begriff Firewall (auf Deutsch übersetzt "Feuerschutzwand") ist heute in aller Munde, wenn es um Sicherheit im Internet geht. Da Computer und Feuer nicht so recht zusammenpassen, stellt sich die Frage, woher der Begriff stammt. Im Hausbau dient eine Feuerschutzwand dazu, ein ausgebrochenes Feuer von anderen Gebäudeteilen abzuschotten. Auf den Computer übertragen, heisst dies, dass die Firewall einen Computer oder ein ganzes Netzwerk schützen soll. Der Schutz richtet sich dabei nicht gegen reale Feuer, sondern die digitalen Gefahren des Internets.

Um diese Aufgabe wahrzunehmen, wird eine Firewall zwischen dem Internet und dem zu schützenden PC im Netzwerk installiert. Die Firewall überprüft alle Daten, die zwischen dem Internet und dem PC in beiden Richtungen fliessen nach bestimmten Regeln und blockiert alle Daten, die ein Satz von Regeln als gefährlich identifiziert. Die Regeln sind normalerweise so gestaltet, dass es sehr schwierig ist, vom Internet auf den geschützten Rechner zuzugreifen aber einfach vom geschützten Rechner aufs Internet. Je nach Firewall hat der Anwender mehr oder weniger Möglichkeiten diese Regeln zu beeinflussen.

Bei der häufig gehörten Unterscheidung von Software- und Hardware-Firewall könnte man den Eindruck erhalten, dass gewisse Firewalls ohne Software auskommen und die anderen keine Hardware brauchen. Dem ist aber nicht so. Diese, etwas unglückliche, Unterscheidung trennt diejenigen Firewalls, die auf Hardware läuft, welche nur für Firewall-Aufgaben verwendet wird, von den Firewalls die als zusätzliche Software auf einem Arbeitsrechner installiert werden.

Weil es bei der als zusätzliche Software installierten Firewall einfacher ist, dass die Firewall durch eine Fehlmanipulation des Anwenders, ein Softwareupdate oder einen Fehler des Betriebssystems in der Funktion eingeschränkt oder vollständig ausgeschaltet wird, sind die Hardware-Firewalls in der Regel sicherer. Die meisten besseren (ADSL-) Router bieten eine einfache, eingebaute Firewall. Wichtig ist bei diesen Produkten immer, dass die Standardpasswörter, welche die Einstellungen der Firewall schützen, geändert werden und dass nur genau die Funktionen eingeschaltet werden, die man wirklich braucht. Denn einige dieser Firewalls erlauben mit den Standardeinstellung die Fernwartung übers Internet. Und wenn diese nur mit Standardpasswörtern (z.B. admin und 1234) geschützt ist, dann hilft auch die Firewall nichts.

Wie bei einer richtigen Feuerschutzwand ist auch bei einer Computer Firewall ein absoluter Schutz nicht möglich. Es lohnt sich aber auf alle Fälle, sich im Geschäft und privat mit einer Firewall zu schützen.

P.S.: Die Kosten für eine einfache Hardware-Firewall belaufen sich auf ca. 200.-. Wenn man das in Relation zu den möglichen Folgen einer Firewall-losen Anbindung ans Internet setzt, dann lohnt sich diese Investition sehr schnell.

Tech Tips: What is "Web 2.0"?

Unlike most other major version changes in software1, you don't need to upgrade anything, buy anything or even do anything in order to use version 2.0 of the web -- in fact, you've probably already used it!

As indicated in the article, What Is Web 2.0 (Design Patterns and Business Models for the Next Generation of Software), the term originally arose from a "conference brainstorming session between O'Reilly and MediaLive International". c indicates the shift in emphasis from a relatively strong separation between content producers and content consumers (or browsers) to a model of collaborative content, where both sites and users post content. Consider how useful Amazon was as a one-stop shopping center for books and now think of how much more useful it is because of wishlists and user reviews. A site like YouTube is almost purely collaborative in that YouTube has only user content -- video, comments and even the ratings all come from the community.

There is a common misconception that Web 2.0 indicates that a web site uses a higher level of technology or a more modern design. Though many 2.0 sites do have a very polished look and generally make heavy use of cutting-edge technology, this is not what makes them Web 2.0. It is that they use these technologies to make it easier for users to create their own content, like comments, blogs or reviews. Many sites combine logins using OpenID or allow users to combine content using "mashups". Ajax technology -- where a browser manipulates only part of a page -- enhances usability considerably when a user adds a comments, a vote or a review.

Naturally, the term Web 2.0 is going to be misused by marketing departments, intent on sounding modern, but don't be fooled -- a colorful logo with a nonsense name doesn't get you, the user, anything. The next generation of the web is about Wikipedia rather than Brittanica, where participation trumps publishing.

  1. See Tech Tips: Software Versions for more information.

Tech Tips: Software Versions

Most computer software has some sort of a version number. The version is composed of two to four numbers, separated by decimal points. More recently, some mainstream products have taken a page from the automobile industry by using the year the product was issued in the product name. Even these products, however, have an internal version number, usually available from an "About this application..." information dialog.

These three numbers are called the major, minor, revision and build numbers. A change in any one number generally resets all of the numbers to the right back to zero. The following list indicates what you can generally expect when a version number change indicates:1


Indicates a significant change in the application. This version number is changed only when the file format for the application has changed irreversibly (if it has a main file format). It is also used to indicate a significant change in how an application is either deployed or used.


Contains both major or minor bug-fixes; may also contain new functions that enhance or extend functionality already available without changing data formats irreversibly or changing how an application is deployed or used.


Contains minor bug-fixes only, which have little to no chance or causing other failures. Bug-fixes with unverifiable side-effects on functionality should cause a minor version number change instead.


An internal number used to pinpoint exactly which source code contributed to the version; used by developers to reproduce errors encountered in the field.

  1. These are certainly not laws and most probably not rules and are best described as guidelines of which many software vendors are not even aware. Your mileage may vary.

Tech Tipp: Was ist "Web 2.0"?

Fast immer, wenn im Computer-Bereich eine Versionsänderung angekündigt wird, ist man gezwungen den Rechner zu upgraden oder zumindest neue Lizenzen zu kaufen. Im Fall von Web 2.0 muss man gar nichts tun, um es zu benutzten -- wir haben es vermutlich alle schon benutzt!

Wie im Artikel What Is Web 2.0 (Design Patterns and Business Models for the Next Generation of Software) beschrieben wurde dieser Begriff bei einer Sitzung von O'Reilly und MediaLive International erfunden. Damit wollen sie ein neues Internet beschreiben, das auf dem Modell der Kollaboration basiert und nicht wie bis anhin eine ziemlich klare Trennung zwischen Produzenten und Konsumenten des Inhalts aufweist. Amazon war als simples Einkaufszentrum für Bücher schon sehr nützlich, wurde aber durch die von Benutzern erstellten Wunschliste und Bewertungen noch viel besser. Eine Web-Site wie YouTube besteht fast nur aus Benutzer-Inhalt -- Videos, Kommentare und auch die Bewertungen kommen nur aus der Community.

Web 2.0 bedeutet nicht, dass eine Web-Site bessere Technologie oder ein moderneres Design verwendet. Tatsache ist, dass die meisten 2.0 Web-Sites sehr geschliffen designt sind und unter Umständen auch neuste Technologien verwenden, aber das macht sie nicht Web 2.0. Sie dürfen nur als Web 2.0 bezeichnet werden, wenn diese Technologien der einfachen Erstellung von Inhalt (z.B. Kommentare, Blogs oder Bewertungen) durch den Benutzer dienen. Viele Web-Sites bieten ein vereintes Login durch OpenID an oder ermöglichen die Kombination von Inhalten verschiedener Quellen durch mashups. Ajax-Technologie -- wo ein Browser nur einen Teil einer Seite bearbeitet -- erhöht die Bedienbarkeit einer Seite deutlich.

Natürlich wird ein Begriff wie Web 2.0 von Marketing-Abteilungen missbraucht, um den Benutzer zu überzeugen, dass sie auch mit dabei sind. Aber Achtung! Ein buntes Logo und ein unsinniger Name verändert für den Benutzer noch gar nichts. Bei der nächsten Web-Generation kommt Teilnehmen vor Publizieren.

Tech Tips: What is a Firewall?

Whenever there are discussions about security on the Internet, you can be sure to hear the word "Firewall". Since fire and computers don't really go well together, it's not naive to ask how this term came into use. In architecture, a firewall is a structure that contains a fire, preventing it from burning down the whole building (akin to the fire-doors in large buildings, like schools or hotels). Translated to the world of computers, a firewall shields a computer -- or a whole network -- from other computers. Instead of protecting against fire, it protected against the digital dangers of the Internet.

In order to fulfill this duty, the firewall is inserted "between" a computer (or network) and the Internet. It monitors all data coming and going from the computer that it's protecting, applying certain rules to block all data that one or another rule has identified as dangerous. The rules are normally written so that it's very difficult to connect from the Internt to the protected computer, but easy for the protected computer to connect to the internet. How much control the user has over these rules depends on the firewall.

The distinction between software and hardware firewalls is often made, leaving the impression that certain firewalls don't need software and others don't need hardware. This is not so. This somewhat unfortunate distinction separates firewall software that runs on dedicated hardware from that which is installed on a normal computer.

Hardware firewalls are generally considered more secure because there is far less of a chance that a software-update, a misconfiguration by the user or an error in the operating system would either partly or completely turn off the firewall than is the case with a software firewall. Most of the better (ADSL-) routers feature a simple, built-in firewall. With these products, however, it is extremely important to change all of the standard passwords and to restrict the functionality to only that which is absolutely necessary. This is the case because most of these products allow remote administration by default and if the password is not changed, then anyone who knows the standard password for that router model can get in and configure the firewall, in which case it won't help you at all.

As with a real-world firewall, there are no absolute guarantees with a computer firewall. However, no one should be on the internet without one, whether from a private or office computer. A simple hardware firewall costs about CHF 200.-. Compared to what it would cost were your computer to be compromised, this is a good investment.

Tech Tips: "Phishing" and "Trojan Horses"

"Tech Tips" provides quick definitions and tips about some of the more common, but baffling, terms and topics swirling around the tech world.

This time around, we define a couple of terms often used when discussing computer security:


This scam tricks people into providing personal information by using fake -- but very official-looking -- emails or web sites. These usually involve requests for bank information in which they reassure the victim that this is a "completely standard account verification" or some other convincing reason.1 The mail asks that the user either reply directly to the email or browse to the bank's web site using a link in the email. The link, however, doesn't go to the bank's home page, but instead goes to a copy of the bank's web site. This copy, instead of providing the expected services, stores the information the user enters -- like password and account number -- so that the "phisher" can use it on the real site later. For more information, see phishing.

How to avoid: Don't click links in emails that look official. And, since banks and online shops will never request account or login information in this way, you should never, ever pass it on over the telephone or through an email or web site. Even if you think the email is legitimate, you should open your web browser and navigate to the site manually instead of clicking the link in the mail. This way, you ensure that you are on the right site instead of the phishing site. If you can't find any information about the offer or request mentioned in the email, then it was a phishing scam. In that case, most banks and stores provide a link from which you can report phishing scams.

Trojan Programs

A trojan2 is a program that secretly installs itself on your computer in order to do one or more of the following:

 * Record and send information from your computer -- like passwords, financial information and so on -- to the owner of the trojan software.
 * Hijack your computer to send out spam, attack other computers by flooding them with requests or to send itself to other potential victims.

For more information, see trojan horse.

How to avoid: As with phishing, be extremely careful about what you click or open online. You should never open attachments that don't come from verifiable sources (e.g. an expected mail from a friend or a document from a co-worker). Be also very careful of downloading applications from smaller sites; if you do so, check around first to see if others already reported trojans attached to the downloaded application. In all cases, you should make sure that a virus scanner checks any and all programs before you open them.

  1. Phishing scams are often also referred to as "social engineering" scams because they convince people to give up their information willingly instead of hacking accounts or passwords electronically.

  2. These programs get their name from the legendary "Trojan Horse" ploy, in which some "deserting" Greek soldiers offered a wooden horse to the enemy Trojans as a way of appeasing the Gods and rendering their city impregnable. However, inside the horse were hand-picked Greek warriors, who snuck out of it at night and opened the gates for their fellows waiting outside. For more information, see Trojan Horse

Tech Tipps: "Phishing" und "Trojaner"

"Tech Tipps" liefert kurze Definitionen und Tipps für einige der oft gehörten aber auch oft verwirrenden Begriffen aus der Internet Welt.

Diesmal nehmen wir die folgende Sicherheitsbegriffe ins Visier:


Dies ist eine neue Masche um Leute dazu zu bringen, persönliche Informationen preiszugeben. Die Informationen werden von den Betrügern in der Folge dazu verwendet, sich auf Kosten anderer zu bereichern. Meist wird man dazu aufgefordert, wegen einer normalen Routineüberprüfung, oder einem anderen Grund 1, gewisse Informationen direkt per E-Mail zurückzusenden oder dann auf einer bestimmte Webseite anzugeben.

Bei der Webseite handelt es sich nicht um die Originalseite sondern um eine täuschend echt gefälschte Kopie. Diese optisch identische Kopie der Seite liefert nicht die erwarteten Leistungen, sondern speichert jede Information, die der Benutzer eingibt. Damit können die Betrüger nachher auf der echten Seite die Informationen der Personen nutzen, die auf Ihre Täuschung hereingefallen sind. Mehr Informationen gibts auch unter phishing (Quelle: Wikipedia).

Wie kann ich mich schützen: Auf keinen Fall einem Link in einer E-Mail folgen, auch wenn die Mitteilung noch so offiziell aussieht. Und weil Banken und Online-Shops nie in dieser Art Informationen anfragen, sollten solche Informationen niemals übers Telefon, per E-Mail oder auf einer Webseite angegeben werden. Ist man unsicher ob die Mitteilung nicht doch wichtig ist, so kann man manuell im Web-Browser die übliche bekannte Adresse der Webseite eingeben anstatt dem Link zu folgen. So stellt man sicher, dass man sich auf der echten Seite befindet und nicht auf einer Kopie landet. Falls dann auf der echten Seite keine entsprechenden Informationen, Aufforderungen ersichtlich sind, war die Aufforderung ein Phishing Versuch. Um einen Beitrag an die Abwehr solcher Attacken zu leisten, haben die meisten Banken und Online-Shops einen Kontaktlink um solche Betrugsversuche zu melden.

Trojaner Software

Ein Trojaner 2 ist ein Programm, das sich heimlich auf ihrem Computer installiert um verschiedenste Aufgaben auszuführen.

 * Es speichert Informationen von ihrem Computer, z.B. Passwörter, Finanzdaten und ähnliches und sendet diese an den Absender der Trojaner Software.
 * Es kann ihren Computer regelrecht übernehmen und dazu benutzen Spam zu versenden, andere Computer durch Massenanfragen lahmzulegen oder sich selber auf weitere Computer zu verbreiten.

Mehr Informationen zum Thema finden sie unter trojanisches Pferd (Quelle: Wikipedia).

Wie kann ich mich schützen: Ähnlich wie beim Phishing ist grosse Zurückhaltung beim Öffnen und Anwählen im Internet angesagt. Nie unerwartete Anhänge öffnen, die von unbekannten Absendern kommen. Im Gegensatz zum erwarteten Dokument vom Kollegen aus dem Vereinsvorstand. Ebenfalls ist Vorsicht angebracht wenn von kleineren Webseiten Applikationen heruntergeladen werden. Da empfiehlt es sich erst rumzufragen (Online Foren) ob jemand schlechte Erfahrungen mit Trojanern gemacht hat. Auf jeden Fall sollte ihr Virenscanner jedes einzelne Programm überprüfen bevor es geöffnet wird.

  1. Oft wird im Zusammenhang mit Phishing auch der Ausdruck "Social Engineering" verwendet. Dies weil die Betrüger andere Leute davon überzeugen, wichtige persönliche Informationen bewusst herauszugeben anstatt die gewünschten Passwörter auf elektronischem Weg zu hacken.

  2. Diese Programme haben ihren Namen von der Legende des Trojanischen Pferdes aus der griechischen Geschichte. Da die Belagerung von Troja durch die Griechen nicht erfolgreich war, griffen sie zu einer List. Sie hinterliessen beim Abzug hinter die nächsten Hügel ein grosses hölzernes Pferd vor den Toren der Stadt. Die Trojaner holten sich die Trophäe in die Stadt. Im Schutze der Nacht verliess eine Gruppe griechischer Krieger den Rumpf des Holzpferdes und öffneten die Stadttore, so dass ihre Kameraden nun in die Stadt eindringen konnten und Troja fiel. Mehr Information dazu auch unter trojanisches Pferd (Quelle: Wikipedia)

From MediaWiki to XWiki part I

As announced in our latest newsletter, we're moving our internal Wiki from MediaWiki to XWiki, due primarily to a lack of fine-grained permission handling.

XWiki uses so called "Spaces" to separate content on different topics in it's Wiki. A page belongs to one such space, but you're free to link between those spaces. You can grant or deny access rights per page and per space. These access rights can restrict a single user or a whole group.

After our move to XWiki, we will have several public spaces for development, general information, etc. and some restricted spaces like finances.

The Plan

The move will take place in two phases:

  1. Export / Conversion to the new markup
  2. Import and assign the spaces

We've evaluated the following options to export/convert our pages to XWiki:

  • Move all pages by hand
  • Use one/many RegExp to convert the output of SpecialPages:Export (big XML document with ugly CDATA sections)
  • Transform the HTML page using XSLT to the XWiki markup
  • Use a dialect plugin to HTML::WikiConverter

Moving all our pages by hand was, of course, out of the question. The RegExp option got canned as this would be a one-time solution and you'd have to manually fetch all pages via MediaWiki.

Transforming the HTML page using XSLT would have been a viable solution but extending something existing (HTML::WikiConverter) was more appealing because we could give the community something useful back.


Lets have an overhead look at our solution. We've written two scripts to implement our two phases:


A Perl script that utilizes the HTML::WikiConverter Perl module to convert a single HTML page to the XWiki markup (using my XWiki dialect plugin written to achieve this move).


A Groovy script that bulk-imports all pages into a given space. The pages written by wikifetch.pl are matched by a regular expression and stored to a given space.


HTML::WikiConverter lacked XWiki support but that was easily cured (committing it to CPAN was another issue). Encountering Perl for the first time wasn't as scary as I thought it would be. And after working with it for some time, you'll like the possibilities of compressing multiple lines of code into one small line. (that is one damned slippery slope, though. --ed.)

But HTML::WikiConverter was made for converting single pages. That's where wikifetch.pl comes into play.


This script takes a working-set of Wiki page-names from a file (pending.txt), then downloads & converts them to the XWiki markup. After that, it extracts all internal links and puts them onto the working-stack. The resulting XWiki pages are stored in an output directory, ready for the import.

In the following section, I'll talk about the details of the implementation. If you don't want to be bothered with that, just skip ahead to the utilization section.


First we have the usual Perl module initialization:

package main;

use warnings;
use strict;

use HTML::WikiConverter;
use HTML::WikiConverter::XWiki;
use Data::Dumper; 
use LWP::Simple;
use URI;

To identify which references are linking to other Wiki pages we'll need to know the wiki uri:

my $wiki_rel_uri = "/index.php/";
my $wiki_uri = 'http://wiki'.$wiki_rel_uri;

The next few variables will hold our working-stack. Variables prepended by '%' are hashes (the ones you know from your ADT classes). The other ones with an '@' in front of them are arrays.

my %links = ();
my @pending_pages = ();
my %page_is_pending = ();
my %done_pages = ();

MediaWiki has tons of elements that we neither need nor want to have in our resulting XWiki markup. So we're defining a hash containing attribute-content and attribute-name. The first line will cause the removal of all HTML tags with an attribute 'class' with the content 'editsection' (<.. class="editsection" ../>

my %tags_toRemove = ( 'editsection' => 'class',
                      'toc' => 'class' 
                      'column-one' => 'id',
                      'jump-to-nav' => 'id',
                      'siteSub' => 'id',
                      'editsection' => 'class',
                      'printfooter' => 'class',
                      'footer' => 'id'

The following variable contains a regexp that matches on all extensions that we don't want to process (images & documents):

my $binformat_filters = '(\.jpg|\.png|\.zip|\.odt|\.gif)$';

The next line is the first that actually executes something:

my $wc = new HTML::WikiConverter(
  dialect => 'XWiki',
  wiki_uri => $wiki_rel_uri,
  preprocess => \&_preprocess,
  space_identifier => 'MySpacePlaceholder'

We'll create an instance of the WikiConverter with the dialect XWiki, then give it our URI (needed to determine if a link is in fact a wiki-link). The next parameter is a reference to our _preprocess function. This preprocess function will remove extra elements from the HTML-Tree that will clutter our output (like MediaWiki navigation elements). The space_identifier is an attribute introduced by HTML::WikiConverter::XWiki and defines the space-prefix, prepended to all links emitted to the resulting file.

The next two lines, though in Perl, should be self explanatory:

# read pending pages from my config-file

# creating output directory
mkdir( "output" );

We're slowly approaching the main processing loop of our perl-script:

01. while( scalar( @pending_pages ) > 0 ) {
02.   %links = ();
03.   my $page = shift( @pending_pages );
04.   _process_wiki_page( $page );
06.   # accounting
07.   $done_pages{ $page } = 1;  
08.   delete( $page_is_pending{ $page } );
10.   # check for new pages
11.   map { print "New page '$_'\n"; 
12. 	    push( @pending_pages, $_ );
13. 	    $page_is_pending{ "$_" } = 1; 
14.       } grep {                               # not already in progress or done                               non-empty
15.                   $_ if (not ((exists $done_pages{ "$_" }) or (exists $page_is_pending{ "$_" }))) and ($_ !~ '^$')
16.               } keys %links;
17.   my $numDone = scalar(keys %done_pages);
18.   my $numTotal = $numDone + scalar(@pending_pages);
19.   print "Progress: $numDone / $numTotal\n";
20. }

I won't go into details of the above; those of you that are Perl literates should be able to read it.

We get a page from our pending_pages array (line 3) and send it to our main processing sub (everything is a sub in Perl, that's what I've been told). After processing we mark the page as done (line 7) and remove it from the pending hash. The reason for having a pending hash and a pending array is so that we don't have to search the whole array for a single page. That's what hashes are for.

Lines 11 to 16 are actually written in the tongue of Mordor; the sound of these words should not be uttered here. After calling _process_wiki_page which in due course will call _preprocess, all links found in the actually processed page get stored to the links hash. We're iterating over this hash and push all pages not yet processed or pending to the end of our processing-array.

It's now time to generate some statistics for the user. Lines 17-19 do that and print it to the command-line (scalar( xy ) returns an integer representing the element count).

Now that we're done with the above code snippet, we'll dive into our subroutines. The first one reads all pending-pages (CR-separated) from a file called pending.txt. Nothing fancy about it.

sub _read_config {
  print "Reading config...\n";
  @pending_pages = ();
  open FILE , "<pending.txt"; 
  while( <FILE> ) {
    push( @pending_pages, $_ ); 
    $page_is_pending{ $_ } = 1;
  close FILE;
  print "Pending pages:\n";
  print join "\n", @pending_pages;
  print "Done reading config\n";

In _process_wiki_page, we create the output-file for our XWiki markup and start the actual processing:

sub _process_wiki_page {
  my ( $page_name_orig ) = @_;
  open FILE, ">output/"."$page_name_orig" || die "Could not create file output/$page_name_orig";
  my $page_name = "$wiki_uri"."$page_name_orig";

  print "Fetching/processing: $page_name\n";
  my $wiki_text = $wc->html2wiki( uri => $page_name );
  print FILE $wiki_text;
  close FILE;

  # check page_translations for the space to put the file into... mkdir on that name and save the file there for uploading...
  print "Processed...\n";

Last but not least, we have the _preprocess function. This is called just after HTML::WikiConverter has parsed the input-file. The argument is a HTML::Tree object.

sub _preprocess {
  my( $tb ) = @_;

The next lines remove all unwanted MediaWiki nodes (as mentioned above, using the tags_toRemove hash):

#delete all tags below our root node, identified by %tags_toRemove 
  #(e.g. remove all elements with the class-attribute set to 'editsection')
  map { $_->delete; } map { $tb->look_down( $tags_toRemove{ $_ }, $_ ) } keys %tags_toRemove;

After the tree has been cleansed, we go after the links (<a/>-tags). Those have to be non-empty, not a special-page, non-binary-extension and should link into our Wiki.

# search for a tags, beginning with the wiki url and set these keys (minus the url-part) to 1 in our link hash
  map {
        $_ =~ s/#(.*)//; 
        $links{ $_ } = 1; 
      grep {                      # non empty        no special pages                                                has to be local              remove local part  
                defined( $_ ) and $_ !~ '^$' and $_ !~ '(Special|Image|Help):' and $_ !~ $binformat_filters and $_ =~ /^$wiki_rel_uri/ and $_ =~ s/$wiki_rel_uri// 
           } map { 
                   $_->attr( 'href' )
                 } $tb->look_down( _tag => 'a' );

What's left is to escape some special characters (this will eventually be moved to HTML::WikiConverter::XWiki):

foreach my $node ( $tb->descendants ) {
    if( !$node->look_up( _tag => 'pre' ) ) {
		my $txt = $node->attr('text') || '';
		$txt =~ s/\\/\\\\\\/g;
		$txt =~ s/\[/\\[/g;
		$txt =~ s/\]/\\]/g;
		$node->attr( 'text', $txt );

...and we're done. Phew.


To start converting your existing MediaWiki execute the following steps:

  1. Download wikifetch.pl.
  2. Install the required CPAN-modules with perl -MCPAN -e 'install HTML::WikiConverter::XWiki'
  3. Edit the base_uri of you're MediaWiki inside wikifetch.pl
  4. Add Main_Page to pending.txt
  5. Execute perl wikifetch.pl

Now you should have a folder named output containing your wiki-content. You can either add these pages to XWiki by hand ... or wait for my next article to import the pages automatically.


The result of wikifetch.pl is an output-directory consisting of files in the XWiki markup. In the next article we'll learn how to get those files into our XWiki.