more formating fixes, additional information

This commit is contained in:
Uwe Steinmann 2025-10-23 15:44:21 +02:00
parent ed36b88a09
commit fc8d01df6b

View File

@ -1,20 +1,37 @@
# Commands for converting documents
This file contains commands for converting different document types
into
SeedDMS has a very sophisticated file conversion process which could
be used to convert any format into any other format, if there is either
a command (on the command line) or a SeedDMS extension with php code
doing the conversion. This could of course use an external service
(e.g. Tika) for doing the conversion. There are already several
extensions for this purpose and SeedDMS provides some buildin
conversions as well. Traditionally, conversion was just used
internally by SeedDMS (and this is still the main purpose), but
this may not be the only use case.
This file only contains commands for converting different document
types into
* text (for fulltext search)
* png (for preview images)
* pdf (for pdf documents)
Such conversions may not necessarily output an excact equivalent of
Most of the required commands can easily be installed on a Linux
server, which is the preferred plattform anyway. Other operating
systems may work as well, but your milage may vary.
The conversion commands can be configured in the settings of SeedDMS.
A conversion may not necessarily output an excact equivalent of
the input file, but outputs a suitable representation, e.g.
converting an mp3 file into text may output the metadata or even the
lyrics of the song. Converting it into a preview image may result
in a picture of the album cover.
in a picture of the album cover, or a graphical representation
of the spectrum.
Please note, that when ever a command outputs anything to stderr,
this will considered as a failure of the command. Most command line
Please note, that whenever a command outputs anything to stderr,
this will be considered as a failure of the command. Most command line
programs have a parameter (.e.g. `-q`) to suppress such an output.
If you run php-fpm you may encounter problems with charsets based on
@ -25,21 +42,28 @@ UTF-8 chars. In such a case you may want to set `clear_env=no` in
php-fpm's configuration. On Debian this is done in the file
`/etc/php/<php version>/fpm/pool.d/www.conf`. Search for `clear_env`.
The following sections will list possible conversion commands for
extracting text, creating an image, and converting to pdf.
## Conversion to text for fulltext search
### text/plain, text/csv, application/csv
`cat '%s'`
Unless you run a very old version of SeedDMS, you will never need
this command for converting text files. SeedDMS has this trivial
converter build in.
### application/pdf
`pdftotext -q -nopgbrk %s - | sed -e 's/ [a-zA-Z0-9.]\{1\} / /g' -e 's/[0-9.]//g'`
If pdftotext takes too long on large document you may want to pass parameter
`-l` to specify the last page to be converted. `-q` is for suppressing error/warnings
send to stderr
If pdftotext takes too long on large document, then you may want to
pass parameter `-l` to specify the last page to be converted. `-q` is
for suppressing error/warnings send to stderr
`mutool draw -F txt -q -N -o - %s `
`mutool draw -F txt -q -N -o - %s`
### application/vnd.openxmlformats-officedocument.wordprocessingml.document
@ -65,6 +89,8 @@ send to stderr
`html2text %s`
### Many office formats
Many office formats can be converted with `unoconv`, though this turned
out in the past to sometimes crash or taking a long time.
@ -79,6 +105,13 @@ plain text in return.
Of course this requires to first install Apache Tika when using the docker
image.
Finally, there is a SeedDMS extension
(unoserver)[https://codeberg.org/SeedDMS/unoserver] which is based
on a project also called
(unoserver)[https://github.com/unoconv/unoserver] and which is
available as docker image, making it quite easy to setup. Read the
documentation of the extension for more information.
## Conversion to pdf for pdf preview
### text/plain, text/csv, application/csv, application/vnd.oasis.opendocument.text application/msword, application/vnd.wordperfect, text/rtf
@ -89,7 +122,10 @@ image.
`convert -density 300 '%f' 'pdf:%o'`
Actually `convert` can be used for many other image formats.
Actually `convert` can be used for many other image formats. There is
also a SeedDMS extension called
[convert_image](https://codeberg.org/SeedDMS/convert_image) which
embedds the image into a pdf file.
### image/svg+xml
@ -125,15 +161,23 @@ Converting from application/x-xopp to pdf only works if the xopp file
does not use a pdf document as a background, because this pdf is not
stored in the xopp fіle.
### Many office formats
As already mentioned above, `unoconv` has some disadvantages. It is
recommended to the `unoserver` SeedDMS extension already described
above.
## Conversion to png for preview images
If you have problems running convert on PDF documents then read the page
https://askubuntu.com/questions/1081895/trouble-with-batch-conversion-of-png-to-pdf-using-convert
It basically instructs you to comment out the line
```
<policy domain="coder" rights="none" pattern="PDF" />
```
in /etc/ImageMagick-6/policy.xml
in `/etc/ImageMagick-6/policy.xml`
`convert` determines the format of the converted image from the extension of
the output filename. SeedDMS usually sets a propper extension when running
@ -180,7 +224,11 @@ why the input needs to be recoded with iconv or recode.
### application/msword, application/vnd.oasis.opendocument.spreadsheet, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, application/vnd.ms-excel, application/vnd.openxmlformats-officedocument.wordprocessingml.document, text/rtf, application/vnd.ms-powerpoint, text/csv, application/csv, application/vnd.wordperfect,
`unoconv -d document -e PageRange=1 -f pdf --stdout -v '%f' | gs -dBATCH -dNOPAUSE -sDEVICE=pngalpha -dPDFFitPage -r72x72 -sOutputFile=- -dFirstPage=1 -dLastPage=1 -q - | convert -resize %wx png:- 'png:%o'`
`unoconv -d document -e PageRange=1 -f pdf --stdout -v '%f' | gs -dBATCH -dNOPAUSE -sDEVICE=pngalpha -dPDFFitPage -r72x72 -sOutputFile=- -dFirstPage=1 -dLastPage=1 -q - | convert -resize %wx png:- 'png:%o'`
If you are looking for an easier solution, you should consider to
install the `unoserver` SeedDMS extension which was already described
above.
### video/webm, video/mp4