seeddms-code/doc/README.Converters.md
2025-10-23 16:05:19 +02:00

8.8 KiB
Raw Blame History

Commands for converting documents

SeedDMS has a very sophisticated file conversion process which could be used to convert any format into any other format, if there is either a command (on the command line) or a SeedDMS extension with php code doing the conversion. This could of course use an external service (e.g. Tika) for doing the conversion. There are already several extensions for this purpose and SeedDMS provides some buildin conversions as well. Traditionally, conversion was just used internally by SeedDMS (and this is still the main purpose), but this may not be the only use case.

This file only contains commands for converting different document types into

  • text (for fulltext search)
  • png (for preview images)
  • pdf (for pdf documents)

Most of the required commands can easily be installed on a Linux server, which is the preferred plattform anyway. Other operating systems may work as well, but your milage may vary.

The conversion commands can be configured in the settings of SeedDMS.

A conversion may not necessarily output an excact equivalent of the input file, but outputs a suitable representation, e.g. converting an mp3 file into text may output the metadata or even the lyrics of the song. Converting it into a preview image may result in a picture of the album cover, or a graphical representation of the spectrum.

Please note, that whenever a command outputs anything to stderr, this will be considered as a failure of the command. Most command line programs have a parameter (.e.g. -q) to suppress such an output.

If you run php-fpm you may encounter problems with charsets based on UTF-8. Programms like catdoc read LANG from the environment to set the correct encoding of the output. php-fpm often clears the environment and programms like catdoc will not longer output any UTF-8 chars. In such a case you may want to set clear_env=no in php-fpm's configuration. On Debian this is done in the file /etc/php/<php version>/fpm/pool.d/www.conf. Search for clear_env.

The following sections will list possible conversion commands for extracting text, creating an image, and converting to pdf.

text/plain, text/csv, application/csv

cat '%s'

Unless you run a very old version of SeedDMS, you will never need this command for converting text files. SeedDMS has this trivial converter build in.

application/pdf

pdftotext -q -nopgbrk %s - | sed -e 's/ [a-zA-Z0-9.]\{1\} / /g' -e 's/[0-9.]//g'

If pdftotext takes too long on large document, then you may want to pass parameter -l to specify the last page to be converted. -q is for suppressing error/warnings send to stderr

mutool draw -F txt -q -N -o - %s

application/vnd.openxmlformats-officedocument.wordprocessingml.document

docx2txt '%s' -

application/msword

catdoc %s

application/vnd.oasis.opendocument.text

odt2txt %s

application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

xlsx2csv -d tab %s

application/vnd.ms-excel

xls2csv -d tab %s

text/html

html2text %s

Many office formats

Many office formats can be converted with unoconv, though this turned out in the past to sometimes crash or taking a long time.

unoconv -d document -f txt --stdout '%s'

Apache Tika is another option for creating plain text from various document types. Just use curl to send the document to your tika server and get the plain text in return.

curl -s -T '%s' http://localhost:9998/tika --header 'Accept: text/plain'

Of course this requires to first install Apache Tika when using the docker image.

Finally, there is a SeedDMS extension unoserver which is based on a project also called unoserver and which is available as docker image, making it quite easy to setup. Read the documentation of the extension for more information.

Conversion to pdf for pdf preview

text/plain, text/csv, application/csv, application/vnd.oasis.opendocument.text application/msword, application/vnd.wordperfect, text/rtf

unoconv -d document -f pdf --stdout -v '%f' > '%o'

image/png, image/jpg, image/jpeg

convert -density 300 '%f' 'pdf:%o'

Actually convert can be used for many other image formats. There is also a SeedDMS extension called convert_image which embedds the image into a pdf file.

image/svg+xml

cairosvg -f pdf -o '%o' '%f'

application/vnd.ms-powerpoint, application/vnd.openxmlformats-officedocument.presentationml.presentation, application/vnd.oasis.opendocument.presentation

unoconv -d presentation -f pdf --stdout -v '%f' > '%o'

application/vnd.ms-excel, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, application/vnd.oasis.opendocument.spreadsheet

unoconv -d spreadsheet -f pdf --stdout -v '%f' > '%o'

message/rfc822

java -jar emailconverter-2.5.3-all.jar '%f' -o '%o'

The emailconverter can be obtained from https://github.com/nickrussler/email-to-pdf-converter It requires wkhtmltopdf which is part of debian.

text/plain

iconv -c -f utf-8 -t latin1 '%f' | a2ps -1 -q -a1 -R -B -o - - | ps2pdf - -

The parameter -q is important because a2ps sends some statistical data to stderr, which makes SeedDMS believe the command has failed.

application/x-xopp

xournalpp -p "%o" "%f"

Converting from application/x-xopp to pdf only works if the xopp file does not use a pdf document as a background, because this pdf is not stored in the xopp fіle.

Many office formats

As already mentioned above, unoconv has some disadvantages. It is recommended to the unoserver SeedDMS extension already described above.

Conversion to png for preview images

If you have problems running convert on PDF documents then read the page https://askubuntu.com/questions/1081895/trouble-with-batch-conversion-of-png-to-pdf-using-convert It basically instructs you to comment out the line

<policy domain="coder" rights="none" pattern="PDF" />

in /etc/ImageMagick-6/policy.xml

convert determines the format of the converted image from the extension of the output filename. SeedDMS usually sets a propper extension when running the command, but nevertheless it is good practice to explicitly set the output format by prefixing the output filename with 'png:'. This is of course always needed if the output goes to stdout.

image/jpg, image/jpeg, image/png

convert -resize %wx '%f' 'png:%o'

image/svg+xml

cairosvg -f png --output-width %w -o '%o' '%f'

text/plain

convert -density 100 -resize %wx 'text:%f[0]' 'png:%o'

application/pdf

gs -dBATCH -dNOPAUSE -sDEVICE=png16m -dPDFFitPage -r72x72 -sOutputFile=- -dFirstPage=1 -dLastPage=1 -q '%f' | convert -resize %wx png:- '%o'

convert -density 100 -resize %wx '%f[0]' 'png:%o'

mutool draw -F png -w %w -q -N -o '%o' '%f' 1

pdftocairo '%f' -png -singlefile -scale-to-x %w -scale-to-y -1 - > '%o'

pdftocairo needs to output to stdout because the output file name passed to pdftocairo will be suffixed with .png

application/postscript

convert -density 100 -resize %wx '%f[0]' 'png:%o'

text/plain

iconv -c -f utf-8 -t latin1 '%f' | a2ps -1 -q -a1 -R -B -o - - | gs -dBATCH -dNOPAUSE -sDEVICE=png16m -dFirstPage=1 -dLastPage=1 -dPDFFitPage -r72x72 -sOutputFile=- -q - | convert -resize %wx png:- 'png:%o'

On Linux systems you will have to set the desired value in /etc/papersize for a2ps e.g. a4, or letter. Unfortunately, a2ps cannot process utf-8 encoded files. That's why the input needs to be recoded with iconv or recode.

application/msword, application/vnd.oasis.opendocument.spreadsheet, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, application/vnd.ms-excel, application/vnd.openxmlformats-officedocument.wordprocessingml.document, text/rtf, application/vnd.ms-powerpoint, text/csv, application/csv, application/vnd.wordperfect,

unoconv -d document -e PageRange=1 -f pdf --stdout -v '%f' | gs -dBATCH -dNOPAUSE -sDEVICE=pngalpha -dPDFFitPage -r72x72 -sOutputFile=- -dFirstPage=1 -dLastPage=1 -q - | convert -resize %wx png:- 'png:%o'

If you are looking for an easier solution, you should consider to install the unoserver SeedDMS extension which was already described above.

video/webm, video/mp4

This will take 12th frame of a video and converts into a png. It requires ffmpeg to be installed.

convert -resize %wx "%f[12]" "png:%o"

You may as well use ffmpeg right away

ffmpeg -i "%f" -ss 00:00:02 -frames:v 1 -loglevel quiet -vf scale=%w:-1 -f apng "%o"

audio/mpeg

sox "%f" -n spectrogram -x 600 -Y 550 -r -l -o - | convert -resize %wx png:- "png:%o"

application/x-xopp

xournalpp -i "%o" --export-png-width=%w "%f"

Converting from application/x-xopp to png only works if the xopp file does not use a pdf document as a background, because this pdf is not stored in the xopp fіle.