pkgsrc-Changes-HG archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
[pkgsrc/trunk]: pkgsrc/graphics/tesseract tesseract: updated to 4.0.0
details: https://anonhg.NetBSD.org/pkgsrc/rev/0af26b935f60
branches: trunk
changeset: 314636:0af26b935f60
user: adam <adam%pkgsrc.org@localhost>
date: Sat Nov 03 09:13:07 2018 +0000
description:
tesseract: updated to 4.0.0
V4.0.0:
New OCR engine
- Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains.
- This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model.
- Added trained data that includes LSTM models to 123 languages.
- Added optional accelerated code paths for the LSTM recognizer:
* Using OpenMP
* Using SIMD: AVX2 / AVX / SSE4.1
- Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output.
- The new LSTM engine still does not support all features from the old legacy engine (see missing features).
Other OCR engines
- The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version.
- Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed.
Updated build system
- Tesseract now uses semantic versioning.
- Tesseract now requires Leptonica 1.74.0 or a higher version.
- For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers.
- Added unit tests to the main repo. The unit tests require Git submodules and the code for training.
- Added an option to compile Tesseract without the code of the legacy OCR engine.
- Update minimum required autoconf version to 2.63.
- Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0.
- Reorganized Tesseract's source tree. Most sources are now below the src directory.
Bug fixes and enhancements
- Fixed many issues that triggered compiler warnings.
- Fixed many issues reported by Coverity Scan or LGTM.
- Fixes to trainingdata rendering.
- Fixed damage to binary images when processing PDFs.
- Don't trigger a deliberate segmentation fault for fatal errors in release code.
- Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine.
- Improved multi-page TIFF handling.
- Improvements to PDF rendering.
- Added version information and improved help texts to the training tools.
- Added faster version of log2().
- Documented in tesseract man page the option to use an input text file which contains lists of images.
- Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API).
- Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired.
- The list of available languages and scripts is now sorted alphabetically.
- Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4.
- Removed obsolete code.
diffstat:
graphics/tesseract/Makefile | 8 +-
graphics/tesseract/PLIST | 105 +++++-----------
graphics/tesseract/distinfo | 21 +-
graphics/tesseract/patches/patch-tessdata_Makefile.am | 10 +-
graphics/tesseract/patches/patch-viewer_scrollview.cpp | 14 --
5 files changed, 52 insertions(+), 106 deletions(-)
diffs (truncated from 359 to 300 lines):
diff -r 0c6bc62b8a10 -r 0af26b935f60 graphics/tesseract/Makefile
--- a/graphics/tesseract/Makefile Sat Nov 03 08:07:31 2018 +0000
+++ b/graphics/tesseract/Makefile Sat Nov 03 09:13:07 2018 +0000
@@ -1,7 +1,6 @@
-# $NetBSD: Makefile,v 1.39 2018/07/20 03:34:16 ryoon Exp $
+# $NetBSD: Makefile,v 1.40 2018/11/03 09:13:07 adam Exp $
-DISTNAME= tesseract-3.05.02
-PKGREVISION= 1
+DISTNAME= tesseract-4.0.0
CATEGORIES= graphics
MASTER_SITES= ${MASTER_SITE_GITHUB:=tesseract-ocr/}
DISTFILES= ${DEFAULT_DISTFILES}
@@ -11,7 +10,7 @@
COMMENT= Open Source OCR Engine
LICENSE= apache-2.0
-LANGVER= 3.04.00
+LANGVER= 4.0.0
DISTFILES+= tessdata-${LANGVER}${EXTRACT_SUFX}
SITES.tessdata-${LANGVER}.tar.gz= -${MASTER_SITES:Q}tessdata/archive/${LANGVER}.tar.gz
@@ -22,7 +21,6 @@
CONFIGURE_ENV+= LIBLEPT_HEADERSDIR=${BUILDLINK_PREFIX.leptonica}/include
INSTALL_TARGET= install training-install
-INSTALLATION_DIRS= libexec share/doc/tesseract share/tesseract
post-extract:
${MV} ${WRKDIR}/tessdata-${LANGVER}/* ${WRKSRC}/tessdata
diff -r 0c6bc62b8a10 -r 0af26b935f60 graphics/tesseract/PLIST
--- a/graphics/tesseract/PLIST Sat Nov 03 08:07:31 2018 +0000
+++ b/graphics/tesseract/PLIST Sat Nov 03 09:13:07 2018 +0000
@@ -1,65 +1,47 @@
-@comment $NetBSD: PLIST,v 1.9 2017/02/21 17:51:18 fhajny Exp $
+@comment $NetBSD: PLIST,v 1.10 2018/11/03 09:13:07 adam Exp $
bin/ambiguous_words
bin/classifier_tester
bin/cntraining
+bin/combine_lang_model
bin/combine_tessdata
bin/dawg2wordlist
+bin/language-specific.sh
+bin/lstmeval
+bin/lstmtraining
+bin/merge_unicharsets
bin/mftraining
bin/set_unicharset_properties
bin/shapeclustering
bin/tesseract
+bin/tesstrain.sh
+bin/tesstrain_utils.sh
bin/text2image
bin/unicharset_extractor
bin/wordlist2dawg
include/tesseract/apitypes.h
include/tesseract/baseapi.h
-include/tesseract/basedir.h
include/tesseract/capi.h
-include/tesseract/errcode.h
-include/tesseract/fileerr.h
include/tesseract/genericvector.h
include/tesseract/helpers.h
include/tesseract/host.h
include/tesseract/ltrresultiterator.h
-include/tesseract/memry.h
-include/tesseract/ndminx.h
include/tesseract/ocrclass.h
include/tesseract/osdetect.h
include/tesseract/pageiterator.h
-include/tesseract/params.h
include/tesseract/platform.h
include/tesseract/publictypes.h
include/tesseract/renderer.h
include/tesseract/resultiterator.h
include/tesseract/serialis.h
include/tesseract/strngs.h
+include/tesseract/tess_version.h
include/tesseract/tesscallback.h
include/tesseract/thresholder.h
include/tesseract/unichar.h
-include/tesseract/unicharmap.h
-include/tesseract/unicharset.h
lib/libtesseract.la
lib/pkgconfig/tesseract.pc
-man/man1/ambiguous_words.1
-man/man1/cntraining.1
-man/man1/combine_tessdata.1
-man/man1/dawg2wordlist.1
-man/man1/mftraining.1
-man/man1/shapeclustering.1
-man/man1/tesseract.1
-man/man1/unicharset_extractor.1
-man/man1/wordlist2dawg.1
-man/man5/unicharambigs.5
-man/man5/unicharset.5
share/tessdata/afr.traineddata
share/tessdata/amh.traineddata
-share/tessdata/ara.cube.bigrams
-share/tessdata/ara.cube.fold
-share/tessdata/ara.cube.lm
-share/tessdata/ara.cube.nn
-share/tessdata/ara.cube.params
-share/tessdata/ara.cube.size
-share/tessdata/ara.cube.word-freq
share/tessdata/ara.traineddata
share/tessdata/asm.traineddata
share/tessdata/aze.traineddata
@@ -68,12 +50,15 @@
share/tessdata/ben.traineddata
share/tessdata/bod.traineddata
share/tessdata/bos.traineddata
+share/tessdata/bre.traineddata
share/tessdata/bul.traineddata
share/tessdata/cat.traineddata
share/tessdata/ceb.traineddata
share/tessdata/ces.traineddata
share/tessdata/chi_sim.traineddata
+share/tessdata/chi_sim_vert.traineddata
share/tessdata/chi_tra.traineddata
+share/tessdata/chi_tra_vert.traineddata
share/tessdata/chr.traineddata
share/tessdata/configs/ambigs.train
share/tessdata/configs/api_config
@@ -86,6 +71,8 @@
share/tessdata/configs/kannada
share/tessdata/configs/linebox
share/tessdata/configs/logfile
+share/tessdata/configs/lstm.train
+share/tessdata/configs/lstmdebug
share/tessdata/configs/makebox
share/tessdata/configs/pdf
share/tessdata/configs/quiet
@@ -94,21 +81,15 @@
share/tessdata/configs/tsv
share/tessdata/configs/txt
share/tessdata/configs/unlv
+share/tessdata/cos.traineddata
share/tessdata/cym.traineddata
share/tessdata/dan.traineddata
share/tessdata/dan_frak.traineddata
share/tessdata/deu.traineddata
share/tessdata/deu_frak.traineddata
+share/tessdata/div.traineddata
share/tessdata/dzo.traineddata
share/tessdata/ell.traineddata
-share/tessdata/eng.cube.bigrams
-share/tessdata/eng.cube.fold
-share/tessdata/eng.cube.lm
-share/tessdata/eng.cube.nn
-share/tessdata/eng.cube.params
-share/tessdata/eng.cube.size
-share/tessdata/eng.cube.word-freq
-share/tessdata/eng.tesseract_cube.nn
share/tessdata/eng.traineddata
share/tessdata/eng.user-patterns
share/tessdata/eng.user-words
@@ -117,50 +98,33 @@
share/tessdata/equ.traineddata
share/tessdata/est.traineddata
share/tessdata/eus.traineddata
+share/tessdata/fao.traineddata
share/tessdata/fas.traineddata
+share/tessdata/fil.traineddata
share/tessdata/fin.traineddata
-share/tessdata/fra.cube.bigrams
-share/tessdata/fra.cube.fold
-share/tessdata/fra.cube.lm
-share/tessdata/fra.cube.nn
-share/tessdata/fra.cube.params
-share/tessdata/fra.cube.size
-share/tessdata/fra.cube.word-freq
-share/tessdata/fra.tesseract_cube.nn
share/tessdata/fra.traineddata
share/tessdata/frk.traineddata
share/tessdata/frm.traineddata
+share/tessdata/fry.traineddata
+share/tessdata/gla.traineddata
share/tessdata/gle.traineddata
share/tessdata/glg.traineddata
share/tessdata/grc.traineddata
share/tessdata/guj.traineddata
share/tessdata/hat.traineddata
share/tessdata/heb.traineddata
-share/tessdata/hin.cube.bigrams
-share/tessdata/hin.cube.fold
-share/tessdata/hin.cube.lm
-share/tessdata/hin.cube.nn
-share/tessdata/hin.cube.params
-share/tessdata/hin.cube.word-freq
-share/tessdata/hin.tesseract_cube.nn
share/tessdata/hin.traineddata
share/tessdata/hrv.traineddata
share/tessdata/hun.traineddata
+share/tessdata/hye.traineddata
share/tessdata/iku.traineddata
share/tessdata/ind.traineddata
share/tessdata/isl.traineddata
-share/tessdata/ita.cube.bigrams
-share/tessdata/ita.cube.fold
-share/tessdata/ita.cube.lm
-share/tessdata/ita.cube.nn
-share/tessdata/ita.cube.params
-share/tessdata/ita.cube.size
-share/tessdata/ita.cube.word-freq
-share/tessdata/ita.tesseract_cube.nn
share/tessdata/ita.traineddata
share/tessdata/ita_old.traineddata
share/tessdata/jav.traineddata
share/tessdata/jpn.traineddata
+share/tessdata/jpn_vert.traineddata
share/tessdata/kan.traineddata
share/tessdata/kat.traineddata
share/tessdata/kat_old.traineddata
@@ -168,20 +132,26 @@
share/tessdata/khm.traineddata
share/tessdata/kir.traineddata
share/tessdata/kor.traineddata
+share/tessdata/kor_vert.traineddata
share/tessdata/kur.traineddata
+share/tessdata/kur_ara.traineddata
share/tessdata/lao.traineddata
share/tessdata/lat.traineddata
share/tessdata/lav.traineddata
share/tessdata/lit.traineddata
+share/tessdata/ltz.traineddata
share/tessdata/mal.traineddata
share/tessdata/mar.traineddata
share/tessdata/mkd.traineddata
share/tessdata/mlt.traineddata
+share/tessdata/mon.traineddata
+share/tessdata/mri.traineddata
share/tessdata/msa.traineddata
share/tessdata/mya.traineddata
share/tessdata/nep.traineddata
share/tessdata/nld.traineddata
share/tessdata/nor.traineddata
+share/tessdata/oci.traineddata
share/tessdata/ori.traineddata
share/tessdata/osd.traineddata
share/tessdata/pan.traineddata
@@ -189,35 +159,26 @@
share/tessdata/pol.traineddata
share/tessdata/por.traineddata
share/tessdata/pus.traineddata
+share/tessdata/que.traineddata
share/tessdata/ron.traineddata
-share/tessdata/rus.cube.fold
-share/tessdata/rus.cube.lm
-share/tessdata/rus.cube.nn
-share/tessdata/rus.cube.params
-share/tessdata/rus.cube.size
-share/tessdata/rus.cube.word-freq
share/tessdata/rus.traineddata
share/tessdata/san.traineddata
share/tessdata/sin.traineddata
share/tessdata/slk.traineddata
share/tessdata/slk_frak.traineddata
share/tessdata/slv.traineddata
-share/tessdata/spa.cube.bigrams
-share/tessdata/spa.cube.fold
-share/tessdata/spa.cube.lm
-share/tessdata/spa.cube.nn
-share/tessdata/spa.cube.params
-share/tessdata/spa.cube.size
-share/tessdata/spa.cube.word-freq
+share/tessdata/snd.traineddata
share/tessdata/spa.traineddata
share/tessdata/spa_old.traineddata
share/tessdata/sqi.traineddata
share/tessdata/srp.traineddata
share/tessdata/srp_latn.traineddata
+share/tessdata/sun.traineddata
share/tessdata/swa.traineddata
share/tessdata/swe.traineddata
share/tessdata/syr.traineddata
share/tessdata/tam.traineddata
+share/tessdata/tat.traineddata
share/tessdata/tel.traineddata
share/tessdata/tessconfigs/batch
share/tessdata/tessconfigs/batch.nochop
@@ -229,6 +190,7 @@
share/tessdata/tgl.traineddata
share/tessdata/tha.traineddata
share/tessdata/tir.traineddata
+share/tessdata/ton.traineddata
share/tessdata/tur.traineddata
share/tessdata/uig.traineddata
share/tessdata/ukr.traineddata
@@ -237,3 +199,4 @@
share/tessdata/uzb_cyrl.traineddata
share/tessdata/vie.traineddata
share/tessdata/yid.traineddata
+share/tessdata/yor.traineddata
diff -r 0c6bc62b8a10 -r 0af26b935f60 graphics/tesseract/distinfo
--- a/graphics/tesseract/distinfo Sat Nov 03 08:07:31 2018 +0000
+++ b/graphics/tesseract/distinfo Sat Nov 03 09:13:07 2018 +0000
@@ -1,12 +1,11 @@
-$NetBSD: distinfo,v 1.18 2018/06/22 09:50:16 adam Exp $
+$NetBSD: distinfo,v 1.19 2018/11/03 09:13:07 adam Exp $
Home |
Main Index |
Thread Index |
Old Index