mirror of
https://git.code.sf.net/p/seeddms/code
synced 2025-11-27 10:00:41 +00:00
more formating fixes, additional information
This commit is contained in:
parent
ed36b88a09
commit
fc8d01df6b
|
|
@ -1,20 +1,37 @@
|
||||||
# Commands for converting documents
|
# Commands for converting documents
|
||||||
|
|
||||||
This file contains commands for converting different document types
|
SeedDMS has a very sophisticated file conversion process which could
|
||||||
into
|
be used to convert any format into any other format, if there is either
|
||||||
|
a command (on the command line) or a SeedDMS extension with php code
|
||||||
|
doing the conversion. This could of course use an external service
|
||||||
|
(e.g. Tika) for doing the conversion. There are already several
|
||||||
|
extensions for this purpose and SeedDMS provides some buildin
|
||||||
|
conversions as well. Traditionally, conversion was just used
|
||||||
|
internally by SeedDMS (and this is still the main purpose), but
|
||||||
|
this may not be the only use case.
|
||||||
|
|
||||||
|
This file only contains commands for converting different document
|
||||||
|
types into
|
||||||
|
|
||||||
* text (for fulltext search)
|
* text (for fulltext search)
|
||||||
* png (for preview images)
|
* png (for preview images)
|
||||||
* pdf (for pdf documents)
|
* pdf (for pdf documents)
|
||||||
|
|
||||||
Such conversions may not necessarily output an excact equivalent of
|
Most of the required commands can easily be installed on a Linux
|
||||||
|
server, which is the preferred plattform anyway. Other operating
|
||||||
|
systems may work as well, but your milage may vary.
|
||||||
|
|
||||||
|
The conversion commands can be configured in the settings of SeedDMS.
|
||||||
|
|
||||||
|
A conversion may not necessarily output an excact equivalent of
|
||||||
the input file, but outputs a suitable representation, e.g.
|
the input file, but outputs a suitable representation, e.g.
|
||||||
converting an mp3 file into text may output the metadata or even the
|
converting an mp3 file into text may output the metadata or even the
|
||||||
lyrics of the song. Converting it into a preview image may result
|
lyrics of the song. Converting it into a preview image may result
|
||||||
in a picture of the album cover.
|
in a picture of the album cover, or a graphical representation
|
||||||
|
of the spectrum.
|
||||||
|
|
||||||
Please note, that when ever a command outputs anything to stderr,
|
Please note, that whenever a command outputs anything to stderr,
|
||||||
this will considered as a failure of the command. Most command line
|
this will be considered as a failure of the command. Most command line
|
||||||
programs have a parameter (.e.g. `-q`) to suppress such an output.
|
programs have a parameter (.e.g. `-q`) to suppress such an output.
|
||||||
|
|
||||||
If you run php-fpm you may encounter problems with charsets based on
|
If you run php-fpm you may encounter problems with charsets based on
|
||||||
|
|
@ -25,21 +42,28 @@ UTF-8 chars. In such a case you may want to set `clear_env=no` in
|
||||||
php-fpm's configuration. On Debian this is done in the file
|
php-fpm's configuration. On Debian this is done in the file
|
||||||
`/etc/php/<php version>/fpm/pool.d/www.conf`. Search for `clear_env`.
|
`/etc/php/<php version>/fpm/pool.d/www.conf`. Search for `clear_env`.
|
||||||
|
|
||||||
|
The following sections will list possible conversion commands for
|
||||||
|
extracting text, creating an image, and converting to pdf.
|
||||||
|
|
||||||
## Conversion to text for fulltext search
|
## Conversion to text for fulltext search
|
||||||
|
|
||||||
### text/plain, text/csv, application/csv
|
### text/plain, text/csv, application/csv
|
||||||
|
|
||||||
`cat '%s'`
|
`cat '%s'`
|
||||||
|
|
||||||
|
Unless you run a very old version of SeedDMS, you will never need
|
||||||
|
this command for converting text files. SeedDMS has this trivial
|
||||||
|
converter build in.
|
||||||
|
|
||||||
### application/pdf
|
### application/pdf
|
||||||
|
|
||||||
`pdftotext -q -nopgbrk %s - | sed -e 's/ [a-zA-Z0-9.]\{1\} / /g' -e 's/[0-9.]//g'`
|
`pdftotext -q -nopgbrk %s - | sed -e 's/ [a-zA-Z0-9.]\{1\} / /g' -e 's/[0-9.]//g'`
|
||||||
|
|
||||||
If pdftotext takes too long on large document you may want to pass parameter
|
If pdftotext takes too long on large document, then you may want to
|
||||||
`-l` to specify the last page to be converted. `-q` is for suppressing error/warnings
|
pass parameter `-l` to specify the last page to be converted. `-q` is
|
||||||
send to stderr
|
for suppressing error/warnings send to stderr
|
||||||
|
|
||||||
`mutool draw -F txt -q -N -o - %s `
|
`mutool draw -F txt -q -N -o - %s`
|
||||||
|
|
||||||
### application/vnd.openxmlformats-officedocument.wordprocessingml.document
|
### application/vnd.openxmlformats-officedocument.wordprocessingml.document
|
||||||
|
|
||||||
|
|
@ -65,6 +89,8 @@ send to stderr
|
||||||
|
|
||||||
`html2text %s`
|
`html2text %s`
|
||||||
|
|
||||||
|
### Many office formats
|
||||||
|
|
||||||
Many office formats can be converted with `unoconv`, though this turned
|
Many office formats can be converted with `unoconv`, though this turned
|
||||||
out in the past to sometimes crash or taking a long time.
|
out in the past to sometimes crash or taking a long time.
|
||||||
|
|
||||||
|
|
@ -79,6 +105,13 @@ plain text in return.
|
||||||
Of course this requires to first install Apache Tika when using the docker
|
Of course this requires to first install Apache Tika when using the docker
|
||||||
image.
|
image.
|
||||||
|
|
||||||
|
Finally, there is a SeedDMS extension
|
||||||
|
(unoserver)[https://codeberg.org/SeedDMS/unoserver] which is based
|
||||||
|
on a project also called
|
||||||
|
(unoserver)[https://github.com/unoconv/unoserver] and which is
|
||||||
|
available as docker image, making it quite easy to setup. Read the
|
||||||
|
documentation of the extension for more information.
|
||||||
|
|
||||||
## Conversion to pdf for pdf preview
|
## Conversion to pdf for pdf preview
|
||||||
|
|
||||||
### text/plain, text/csv, application/csv, application/vnd.oasis.opendocument.text application/msword, application/vnd.wordperfect, text/rtf
|
### text/plain, text/csv, application/csv, application/vnd.oasis.opendocument.text application/msword, application/vnd.wordperfect, text/rtf
|
||||||
|
|
@ -89,7 +122,10 @@ image.
|
||||||
|
|
||||||
`convert -density 300 '%f' 'pdf:%o'`
|
`convert -density 300 '%f' 'pdf:%o'`
|
||||||
|
|
||||||
Actually `convert` can be used for many other image formats.
|
Actually `convert` can be used for many other image formats. There is
|
||||||
|
also a SeedDMS extension called
|
||||||
|
[convert_image](https://codeberg.org/SeedDMS/convert_image) which
|
||||||
|
embedds the image into a pdf file.
|
||||||
|
|
||||||
### image/svg+xml
|
### image/svg+xml
|
||||||
|
|
||||||
|
|
@ -125,15 +161,23 @@ Converting from application/x-xopp to pdf only works if the xopp file
|
||||||
does not use a pdf document as a background, because this pdf is not
|
does not use a pdf document as a background, because this pdf is not
|
||||||
stored in the xopp fіle.
|
stored in the xopp fіle.
|
||||||
|
|
||||||
|
### Many office formats
|
||||||
|
|
||||||
|
As already mentioned above, `unoconv` has some disadvantages. It is
|
||||||
|
recommended to the `unoserver` SeedDMS extension already described
|
||||||
|
above.
|
||||||
|
|
||||||
## Conversion to png for preview images
|
## Conversion to png for preview images
|
||||||
|
|
||||||
If you have problems running convert on PDF documents then read the page
|
If you have problems running convert on PDF documents then read the page
|
||||||
https://askubuntu.com/questions/1081895/trouble-with-batch-conversion-of-png-to-pdf-using-convert
|
https://askubuntu.com/questions/1081895/trouble-with-batch-conversion-of-png-to-pdf-using-convert
|
||||||
It basically instructs you to comment out the line
|
It basically instructs you to comment out the line
|
||||||
|
|
||||||
|
```
|
||||||
<policy domain="coder" rights="none" pattern="PDF" />
|
<policy domain="coder" rights="none" pattern="PDF" />
|
||||||
|
```
|
||||||
|
|
||||||
in /etc/ImageMagick-6/policy.xml
|
in `/etc/ImageMagick-6/policy.xml`
|
||||||
|
|
||||||
`convert` determines the format of the converted image from the extension of
|
`convert` determines the format of the converted image from the extension of
|
||||||
the output filename. SeedDMS usually sets a propper extension when running
|
the output filename. SeedDMS usually sets a propper extension when running
|
||||||
|
|
@ -180,7 +224,11 @@ why the input needs to be recoded with iconv or recode.
|
||||||
|
|
||||||
### application/msword, application/vnd.oasis.opendocument.spreadsheet, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, application/vnd.ms-excel, application/vnd.openxmlformats-officedocument.wordprocessingml.document, text/rtf, application/vnd.ms-powerpoint, text/csv, application/csv, application/vnd.wordperfect,
|
### application/msword, application/vnd.oasis.opendocument.spreadsheet, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, application/vnd.ms-excel, application/vnd.openxmlformats-officedocument.wordprocessingml.document, text/rtf, application/vnd.ms-powerpoint, text/csv, application/csv, application/vnd.wordperfect,
|
||||||
|
|
||||||
`unoconv -d document -e PageRange=1 -f pdf --stdout -v '%f' | gs -dBATCH -dNOPAUSE -sDEVICE=pngalpha -dPDFFitPage -r72x72 -sOutputFile=- -dFirstPage=1 -dLastPage=1 -q - | convert -resize %wx png:- 'png:%o'`
|
`unoconv -d document -e PageRange=1 -f pdf --stdout -v '%f' | gs -dBATCH -dNOPAUSE -sDEVICE=pngalpha -dPDFFitPage -r72x72 -sOutputFile=- -dFirstPage=1 -dLastPage=1 -q - | convert -resize %wx png:- 'png:%o'`
|
||||||
|
|
||||||
|
If you are looking for an easier solution, you should consider to
|
||||||
|
install the `unoserver` SeedDMS extension which was already described
|
||||||
|
above.
|
||||||
|
|
||||||
### video/webm, video/mp4
|
### video/webm, video/mp4
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue
Block a user