diff --git a/doc/README.Converters.md b/doc/README.Converters.md index f056f6770..3326d2526 100644 --- a/doc/README.Converters.md +++ b/doc/README.Converters.md @@ -1,20 +1,37 @@ # Commands for converting documents -This file contains commands for converting different document types -into +SeedDMS has a very sophisticated file conversion process which could +be used to convert any format into any other format, if there is either +a command (on the command line) or a SeedDMS extension with php code +doing the conversion. This could of course use an external service +(e.g. Tika) for doing the conversion. There are already several +extensions for this purpose and SeedDMS provides some buildin +conversions as well. Traditionally, conversion was just used +internally by SeedDMS (and this is still the main purpose), but +this may not be the only use case. + +This file only contains commands for converting different document +types into * text (for fulltext search) * png (for preview images) * pdf (for pdf documents) -Such conversions may not necessarily output an excact equivalent of +Most of the required commands can easily be installed on a Linux +server, which is the preferred plattform anyway. Other operating +systems may work as well, but your milage may vary. + +The conversion commands can be configured in the settings of SeedDMS. + +A conversion may not necessarily output an excact equivalent of the input file, but outputs a suitable representation, e.g. converting an mp3 file into text may output the metadata or even the lyrics of the song. Converting it into a preview image may result -in a picture of the album cover. +in a picture of the album cover, or a graphical representation +of the spectrum. -Please note, that when ever a command outputs anything to stderr, -this will considered as a failure of the command. Most command line +Please note, that whenever a command outputs anything to stderr, +this will be considered as a failure of the command. Most command line programs have a parameter (.e.g. `-q`) to suppress such an output. If you run php-fpm you may encounter problems with charsets based on @@ -25,21 +42,28 @@ UTF-8 chars. In such a case you may want to set `clear_env=no` in php-fpm's configuration. On Debian this is done in the file `/etc/php//fpm/pool.d/www.conf`. Search for `clear_env`. +The following sections will list possible conversion commands for +extracting text, creating an image, and converting to pdf. + ## Conversion to text for fulltext search ### text/plain, text/csv, application/csv `cat '%s'` +Unless you run a very old version of SeedDMS, you will never need +this command for converting text files. SeedDMS has this trivial +converter build in. + ### application/pdf `pdftotext -q -nopgbrk %s - | sed -e 's/ [a-zA-Z0-9.]\{1\} / /g' -e 's/[0-9.]//g'` -If pdftotext takes too long on large document you may want to pass parameter -`-l` to specify the last page to be converted. `-q` is for suppressing error/warnings -send to stderr +If pdftotext takes too long on large document, then you may want to +pass parameter `-l` to specify the last page to be converted. `-q` is +for suppressing error/warnings send to stderr -`mutool draw -F txt -q -N -o - %s ` +`mutool draw -F txt -q -N -o - %s` ### application/vnd.openxmlformats-officedocument.wordprocessingml.document @@ -65,6 +89,8 @@ send to stderr `html2text %s` +### Many office formats + Many office formats can be converted with `unoconv`, though this turned out in the past to sometimes crash or taking a long time. @@ -79,6 +105,13 @@ plain text in return. Of course this requires to first install Apache Tika when using the docker image. +Finally, there is a SeedDMS extension +(unoserver)[https://codeberg.org/SeedDMS/unoserver] which is based +on a project also called +(unoserver)[https://github.com/unoconv/unoserver] and which is +available as docker image, making it quite easy to setup. Read the +documentation of the extension for more information. + ## Conversion to pdf for pdf preview ### text/plain, text/csv, application/csv, application/vnd.oasis.opendocument.text application/msword, application/vnd.wordperfect, text/rtf @@ -89,7 +122,10 @@ image. `convert -density 300 '%f' 'pdf:%o'` -Actually `convert` can be used for many other image formats. +Actually `convert` can be used for many other image formats. There is +also a SeedDMS extension called +[convert_image](https://codeberg.org/SeedDMS/convert_image) which +embedds the image into a pdf file. ### image/svg+xml @@ -125,15 +161,23 @@ Converting from application/x-xopp to pdf only works if the xopp file does not use a pdf document as a background, because this pdf is not stored in the xopp fіle. +### Many office formats + +As already mentioned above, `unoconv` has some disadvantages. It is +recommended to the `unoserver` SeedDMS extension already described +above. + ## Conversion to png for preview images If you have problems running convert on PDF documents then read the page https://askubuntu.com/questions/1081895/trouble-with-batch-conversion-of-png-to-pdf-using-convert It basically instructs you to comment out the line +``` +``` -in /etc/ImageMagick-6/policy.xml +in `/etc/ImageMagick-6/policy.xml` `convert` determines the format of the converted image from the extension of the output filename. SeedDMS usually sets a propper extension when running @@ -180,7 +224,11 @@ why the input needs to be recoded with iconv or recode. ### application/msword, application/vnd.oasis.opendocument.spreadsheet, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, application/vnd.ms-excel, application/vnd.openxmlformats-officedocument.wordprocessingml.document, text/rtf, application/vnd.ms-powerpoint, text/csv, application/csv, application/vnd.wordperfect, - `unoconv -d document -e PageRange=1 -f pdf --stdout -v '%f' | gs -dBATCH -dNOPAUSE -sDEVICE=pngalpha -dPDFFitPage -r72x72 -sOutputFile=- -dFirstPage=1 -dLastPage=1 -q - | convert -resize %wx png:- 'png:%o'` +`unoconv -d document -e PageRange=1 -f pdf --stdout -v '%f' | gs -dBATCH -dNOPAUSE -sDEVICE=pngalpha -dPDFFitPage -r72x72 -sOutputFile=- -dFirstPage=1 -dLastPage=1 -q - | convert -resize %wx png:- 'png:%o'` + +If you are looking for an easier solution, you should consider to +install the `unoserver` SeedDMS extension which was already described +above. ### video/webm, video/mp4