- HTML, HTM
- ATOM, RSS
- Markdown
- EPUB
- XML, XSL
- DOC, DOCX
- ODT, OTT (experimental)
- RTF
- XLS, XLSX, XLSB, XLSM, XLTX
- CSV
- ODS, OTS
- PPTX, POTX
- ODP, OTP
- ODG, OTG
- PNG, JPG, GIF
application/javascript- All
text/*mime-types.
In almost all cases above, what textract cares about is the mime type. So .html and .htm, both possessing the same mime type, will be extracted. Other extensions that share mime types with those above should also extract successfully. For example, application/vnd.ms-excel is the mime type for .xls, but also for 5 other file types.
Does textract not extract from files of the type you need? Add an issue or submit a pull request. It many cases textract is already capable, it is just not paying attention to the mime type you may be interested in.
npm install textract
Note, if any of the requirements below are missing, textract will run and extract all files for types it is capable. Not having these items installed does not prevent you from using textract, it just prevents you from extracting those specific files.
PDFextraction requirespdftotextbe installed, linkDOCextraction requiresantiwordbe installed, link, unless on OSX in which case textutil (installed by default) is used.RTFextraction requiresunrtfbe installed, link, unless on OSX in which case textutil (installed by default) is used.PNG,JPGandGIFrequiretesseractto be available, link. Images need to be pretty clear, high DPI and made almost entirely of just text fortesseractto be able to accurately extract the text.
Configuration can be passed into textract. The following configuration options are available
preserveLineBreaks: When using the command line this is set totrueto preserve stdout readability. When using the library via node this is set tofalse. Pass this in astrueand textract will not strip any line breaks.preserveOnlyMultipleLineBreaks: Some extractors, like PDF, insert line breaks at the end of every line, even if the middle of a sentence. If this option (defaultfalse) is set totrue, then any instances of a single line break are removed but multiple line breaks are preserved. Check your output with this option, though, this doesn't preserve paragraphs unless there are multiple breaks.exec: Some extractors (doc) use node'sexecfunctionality. This setting allows for providing config toexecexecution. One reason you might want to provide this config is if you are dealing with very large files. You might want to increase theexecmaxBuffersetting.[ext].exec: Each extractor can take specific exec config. Keep in mind many extractors are responsible for extracting multiple types, so, for instance, theodtextractor is what you would configure forodtandodg/odtetc. Check the extractors to see which you want to specifically configure. At the bottom of each is a list oftypesfor which the extractor is responsible.tesseract.lang: A pass-through to tesseract allowing for setting of language for extraction. ex:{ tesseract: { lang:"chi_sim" } }tesseract.cmd:tesseract.langallows a quick means to provide the most popular tesseract option, but if you need to configure more options, you can simply passcmd.cmdis the string that matches the command-line options you want to pass to tesseract. For instance, to provide language andpsm, you would pass{ tesseract: { cmd:"-l chi_sim --psm 10" } }pdftotextOptions: This is a proxy options object to the library textract uses for pdf extraction: pdf-text-extract. Options includeownerPassword,userPasswordif you are extracting text from password protected PDFs. IMPORTANT: textract modifies the pdf-text-extractlayoutdefault so that, instead oflayout: layout, it useslayout:raw. It is not suggested you modify this without understanding what trouble that might get you in. See this GH issue for why textract overrides that library's default.typeOverride: Used withfromUrl, if set, rather than using thecontent-typefrom the URL request, will use the providedtypeOverride.includeAltText: When extracting HTML, whether or not to includealttext with the extracted text. By default this isfalse.
To use this configuration at the command line, prefix each open with a --.
Ex: textract image.png --tesseract.lang=deu
import {extract} from 'textract';
extractFromBuffer(contentBuffer, mimeType, options?);
// or
extractFromFile("/path/to/file.docx", mimeType?, options?);brew install tesseract tesseract-lang
NOTE! The Word processing results are inconsistent between OSX and Linux (different utils are used), so the test themselves are relaxed to accomodate for both cases.