[ dspace-Patches-2234659 ] Add support for DjVu-documents

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[ dspace-Patches-2234659 ] Add support for DjVu-documents

Patches item #2234659, was opened at 2008-11-07 17:06
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting:

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Serhij Dubyk (dubyk)
Assigned to: Nobody/Anonymous (nobody)
Summary: Add support for DjVu-documents

Initial Comment:
Hello All

This patch based on

In DSpace 1.5.0+ we need (before compilation)

1) Add utility djvutxt (package djvulibre), for Debian it is:
   apt-get install djvulibre-bin

2) Edit [dspace-source]/dspace/config/dspace.cfg, text-block "### Media Filter / Format Filter plugins"
and add DjVu-support in 3 places:

   filter.plugins = ... \
                DjVu Text Extractor

   plugin.named.org.dspace.app.mediafilter.FormatFilter = ... \
  org.dspace.app.mediafilter.DjVuFilter =  DjVu Text Extractor

   filter.org.dspace.app.mediafilter.DjVuFilter.inputFormats = DjVu

3) Edit [dspace-source]/dspace/config/registries/bitstream-formats.xml
and add next


4) Create file [dspace-source]/dspace-api/src/main/java/org/dspace/app/mediafilter/DjVuFilter.java
with next content

 Version: 0.1
 DSpace version: 1.4.2 beta
 Author: Ivan Penev
 e-mail: inpenev at gmail.com

package org.dspace.app.mediafilter;

import java.io.InputStream;
import java.io.FileInputStream;
import java.io.BufferedInputStream;
import java.io.ByteArrayInputStream;
import java.io.OutputStream;
import java.io.FileOutputStream;
import java.io.BufferedOutputStream;
import java.io.FileReader;
import java.io.BufferedReader;
import java.io.File;
 * This class provides a media filter for processing files of type DjVu.
 * <p>The current implementation uses a program called
 <code>djvutxt</code>, which extracts the text layer from a previously
 OCR-ed DjVu file and saves it into a UTF-8 text document. The program
 is distributed with the <code>djvulibre</code> package which is freely
 available under the GPL license from <a
 for both Unix and Windows operating systems. Hence, for the media
 filter to work it is required that <code>djvutxt</code> is a valid
 command (in the working environment).</p>

public class DjVuFilter extends MediaFilter
  * Get a filename for a newly created filtered bitstream.
  * @param sourceName
  * name of source bitstream
  * @return filename generated by the filter - for example, document.djvu
  * becomes document.djvu.txt

 public String getFilteredName(String sourceName)
  return sourceName + ".txt";
  * Get name of the bundle this filter will stick its generated bitstreams.
  * @return "TEXT"
 public String getBundleName()
  return "TEXT";
  * Get name of the bitstream format returned by this filter.
  * @return "Text"

 public String getFormatString()
  return "Text";
  * Get a string describing the newly-generated bitstream.
  * @return "Extracted text"

 public String getDescription()
  return "Extracted text";
  * Get a bitstream filled with the extracted text from a DjVu bitstream.
  * <p>The bitstream supplied as a parameter is written to a DjVu
  file on the file system (in the working directory), and the system
  command <code>djvutxt</code> is called on the latter to produce a
  UTF-8 text file containg the extracted text. The file is then copied
  to a bitstream. Finally, the auxiliary files are removed from the file
  system, and the generated bitsream is returned as a result.</p>
  * <p>WARNING! Write access to the working directory is needed for
  this method to operate! No exception handling provided!</p>
  * @param source
  * input stream
  * @return result of filter's transformation, written out to a bitstream

 public InputStream getDestinationStream(InputStream source) throws Exception
  /* Some convenience initializations. */
  final String cmd = "djvutxt";
  final String fileName = "aux";
  final String djvuFileName = fileName + ".djvu";
  final String txtFileName = fileName + ".txt";
  /* Store input bitstresam to auxiliary DjVu file. */
  File djvuFile = streamToFile(source, djvuFileName);
  /* Invoke external command djvutxt with appropriate agruments
   to do the actual job... */
  final String[] cmdArray = {cmd, djvuFileName, txtFileName};
  Process p = Runtime.getRuntime().exec(cmdArray);
  /* ...and wait for it to terminate */
  /* Copy extracted text from file to an independent bitstream,
   and optionally print the text to standard output. */
  File txtFile = new File(txtFileName);
  InputStream dest = fileToStream(txtFile, MediaFilterManager.isVerbose);
  /* Then remove auxiliary files...*/
  /* ...and return resulting bitstream. */
  return dest;
  * Write given input stream to a file on the file system.
  * <p>WARNING! No exception handling!</p>
  * @param inStream input stream
  * @param fileName name of the file to be generated
  * @return <code>File</code> object associated with the generated file
  * @throws Exception

 private File streamToFile(InputStream inStream, String fileName)
 throws Exception
  /* Data will be read from input stream in chunks of size e.g. 4KB. */
  final int chunkSize = 4096;
  byte[] byteArray = new byte[chunkSize];
  /* Open the stream for buffered reading. */
  InputStream bufInStream = new BufferedInputStream(inStream);
  /* Create an empty file (if the file already exists, it will be left
   to store the supplied bitstream... */
  File file = new File(fileName);
  /* ...and associate a buffered output stream with it. */
  OutputStream bufOutStream = new BufferedOutputStream(new
  /* Copy data from input stream to newly generated file. */
  int readBytes = -1;
  while ((readBytes = bufInStream.read(byteArray, 0, chunkSize)) != -1)
  bufOutStream.write(byteArray, 0, readBytes);
  /* Stop transactions to the file system... */
  /* ...and return result. */
  return file;
  * Produce input stream from a given file on the file system.
  * <p>WARNING! No exception handling!</p>
  * @param file <code>File</code> object associated with the given file
  * @return input stream containing the data read from file
  *@throws Exception

 private InputStream fileToStream(File file, boolean verbose) throws Exception
  /* Open the stream for reading. */
  InputStream inStream = new FileInputStream(file);
  /* Allocate necessary memory for data buffer. */
  byte[] byteArray = new byte[(int)file.length()];
  /* Load file contents into buffer. */
  /* And imediately close transactions with the file system. */
  /* If required to send the retrieved data to standard output... */
  if (verbose)
   /* Open the file again, but this tim handle it as a character stream... */
   BufferedReader bufReader = new BufferedReader(new FileReader(file));
   /* ...then print its contents line by line to the standard output... */
   String lineOfText = null;
   while ((lineOfText = bufReader.readLine()) != null)
   /* ...and close connection to the file. */
  /* Finally, generate and return input stream containing desired data. */
  return new ByteArrayInputStream(byteArray);

5) Compilation/recompilation
   cd [dspace-source]/dspace/dspace-1.5.0-src-release/dspace/
   mvn package

6) Install or for recompilation - {edit work bitstream-formats.xml & dspace.cfg as above and replace dspace-api-1.5.0.jar from folders webapps/jspui/WEB-INF/lib/, lib/, webapps/lni/WEB-INF/lib/, webapps/oai/WEB-INF/lib/, webapps/xmlui/WEB-INF/lib/ by compiled [dspace-source]/dspace-api/target/dspace-api-1.5.0.jar}

7) Don't forgive restart Tomcat and run

With best regards
 Serhij Dubyk


You can respond by visiting:

This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
Dspace-devel mailing list
[hidden email]