This program is able to extract the text content of different types of documents. It is based on the technology in the Microsoft Index Server, which uses something called IFilters to index text in files.
Using The Program
The program is very simple to use. It is a command line utility and takes only two parameters. It has to know the file name of the document that you want to extract text from. It also needs the file name of the new file that should hold the extracted text.
Before you are able to run the program you need the following installed on your system:
- Microsoft.NET Framework 4.0.
This program is just a couple of executable files. It doesn't require any installation. You simply unzip the downloaded files and copy them to the folder of your choice.
Extract Text From PDF Documents
The PDF filter DLL needed to extract text from PDF files was included with Adobe Reader 7.0.5 to 9.x. Starting with the release of Adobe Reader 10 also known as Adobe Reader X, this DLL is no longer part of the Adobe Reader installation.
You can still extract text from PDF files if you run Adobe Reader X or another brand of PDF reader. Adobe has a separate download that will install the filter you need. Please follow the link below to get the IFilter from Adobe.
Extract Text From Office Documents
Microsoft offers a filter pack that enables you to extract text from the following file formats: .docx, .docm, .pptx, .pptm, .xlsx, .xlsm, .xlsb, .zip, .one, .vdx, .vsd, .vss, .vst, .vdx, .vsx, and .vtx.
- Support for both 32 and 64 bit filters.
- Now uses Microsoft.NET 4.0 instead of 2.0.
- Improved error handling.
- Always runs in x86 mode to support more filters on 64 bit machines.
- First release.