Skip to content

[BUG] Incorrect path for loading tesseract traineddata #1492

@ibrahim-akrab

Description

@ibrahim-akrab

CCExtractor version: 0.94

Necessary information

  • Is this a regression (i.e. did it work before)? {NO}
  • What platform did you use? {Linux}
  • What were the used arguments? {}

Video links

channel5-2018-02-12.ts from the TV Samples page

Additional information

ccextractor tries to load tesseract traineddata from a wrong location then blames it on the TESSDATA_PREFIX. Here's the output it produces:

Opening file: /home/ibrahim/Downloads/channel5-2018-02-12.ts
File seems to be a transport stream, enabling TS mode
Analyzing data in general mode
Error opening data file /usr/share/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Failed TessBaseAPIInit4 -1

I checked the logic in ocr.c and found that probe_tessdata_location works fine by tracing the syscalls it makes to each possible tessdata location by running strace -e trace=openat ./ccextractor ~/Downloads/channel5-2018-02-12.ts and the result is as follows:

Opening file: /home/ibrahim/Downloads/channel5-2018-02-12.ts
openat(AT_FDCWD, "/home/ibrahim/Downloads/channel5-2018-02-12.ts", O_RDONLY) = 3
File seems to be a transport stream, enabling TS mode
Analyzing data in general mode
openat(AT_FDCWD, "./tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4
openat(AT_FDCWD, "/usr/share/eng.traineddata", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/eng.traineddata", O_RDONLY) = -1 ENOENT (No such file or directory)
Error opening data file /usr/share/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Failed TessBaseAPIInit4 -1

It checks the paths correctly and stops when finding it at /usr/share/tessdata/ so I suspect the problem is possibly in the TessBaseAPIInit4 call.

Also for full reference, here's the complete output of ccextractor --version on my setup:

        Version: 0.94
        Git commit: b1cbfcea9b9c687143bf0d80bc179b563e99d025
        Compilation date: 2023-03-10
        CEA-708 decoder: Rust
        File SHA256: 03bf3b76ff69b73e18166558675278cae9b91f52acce532b80a480c6920b87f4
Libraries used by CCExtractor
        Tesseract Version: 5.3.0
        Leptonica Version: leptonica-1.82.0
        libGPAC Version: 1.0.1
        zlib: 1.2.11
        utf8proc Version: 2.4.0
        protobuf-c Version: 1.3.1
        libpng Version: 1.6.37
        FreeType
        libhash
        nuklear
        libzvbi

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions