Advanced PDF Search (Full Version)

All Forums >> [Community] >> Computer Software and Hardware issues



Message


BeTheBall -> Advanced PDF Search (7/27/2005 9:44:03)

I need a script or other tool that will search through a .pdf and return a list of all web addresses and phone numbers occuring within the document. Has anyone heard of such a thing?

A member of our staff has the assignment of verifying the accuracy of all phone numbers and web addresses in our various form instructions and publications. Currently, she has to go through these items page by page and some are a couple hundred pages long. Anything I can come up with to make this process less burdensome will earn me a gold star.




mar0364 -> RE: Advanced PDF Search (7/27/2005 10:01:32)

Are the documents OCR readable?




BeTheBall -> RE: Advanced PDF Search (7/27/2005 11:00:05)

quote:

ORIGINAL: mar0364

Are the documents OCR readable?


I have no idea. Do you have a solution if it is?




rdouglass -> RE: Advanced PDF Search (7/27/2005 11:43:41)

Some older versions of Acrobat have a feature called "capture" in them that basically OCR's the doc and builds an index in it. The index is searchable with the Acrobat Reader. Unfortunately the current versions don't and Adobe sells that capability as an additional piece of software.

I have recently looked into this issue 'cause I have around 42,000 PDF doc's I need to make searchable. Current search tools like MS Index Server will only search the metadata of the document and AFAIK that has to be 'typed' in at document creation time.

I was looking into the Google "Mini" search appliance but that too requires metadata and we're not too keen on editing 42K doc's by hand. [:o]

If you do find a solution other than Adobe Capture, I would be very interested.

EDIT: If you don't know if they are or not, the probably are not. [:(]




caz -> RE: Advanced PDF Search (7/27/2005 13:01:18)

The Atomz search engine will search within pdf documents; the free version that I used to use on one site certainly did but I think that there is a limit of 500 pages for the equivalent to the free one. I am not tooo sure about that because we no longer use it.




rdouglass -> RE: Advanced PDF Search (7/27/2005 13:07:11)

quote:

The Atomz search engine will search within pdf documents;


I think that's just like the Google Mini - it searches only the metadata *unless* the doc is OCR'd. Unless Atomz has an OCR engine in it, it won't be able to do it unless it is OCR'd.




mar0364 -> RE: Advanced PDF Search (7/27/2005 13:15:35)

Yes if you have Adobe standard an they are OCR you can search them. Douglas is correct the other option is to do a paper capture. I use Adobe Standard 6 and you can paper capture a OCR readable document. However if the document was scanned as an image and saved as a PDF I don't know if there is anything that will search that.

What version of Adobe are you using?




caz -> RE: Advanced PDF Search (7/27/2005 14:48:34)

Have you tried this plugin for the Google Desktop Scansoft Omnipage Search
It would appear to do what is needed to OCR, build a document index and then search. It would also appear that Paper Capture was last in Acrobat4 as standard, but it is available now in Acrobat 7, as an extra I think and combined with other features.




rdouglass -> RE: Advanced PDF Search (7/27/2005 15:02:27)

That ScanSoft Omnipage Search looks interesting but it seems to work only at the individual level and with Google Desktop. To me, that seems pretty restrictive yet it does seem to be an option.

I do still own Acrobat 4 and yest the capture feature does work like that and then you can search it using MS INdex Server. However, it does require a lot of manual intervention.




caz -> RE: Advanced PDF Search (7/28/2005 6:57:12)

My thought was that you could run the Omnipage search on the pdfs in bulk to get a set of results, on the urls at least, and then run that set through a url checker? ( The script bit is beyond me though - I would even use FP2003 link verifier [;)])

As for the phone numbers, I think that's another thing; I don't know of a phone number verifier, apart from a human dialing the number[:)]




rdouglass -> RE: Advanced PDF Search (7/28/2005 9:16:29)

quote:

My thought was that you could run the Omnipage search on the pdfs in bulk to get a set of results, on the urls at least, and then run that set through a url checker?


Hey, that might be an idea and possibly could be 'duct taped' together if it was run at the server. Hmmm...




caz -> RE: Advanced PDF Search (7/28/2005 16:14:13)

Please let me know if that works [:D]




Charles W Davis -> RE: Advanced PDF Search (8/5/2005 20:37:11)

BeTheBall,

A site search provided by master.com will search most pdf within a web site.

I said "most". On this web site all pdf were created from MS Publisher using Adobe Acrobat Pro 6.0. http://www.myscacc.org/search.htm
Search for gabe (one of our contributing authors). It will return several instances.

However, a pdf created from a Quark document for a high resolution glossy magazine will not return any hits.




BeTheBall -> RE: Advanced PDF Search (8/7/2005 19:06:22)

Thanks Charles. I'll have a look.




Page: [1]

Valid CSS!




Forum Software © ASPPlayground.NET Advanced Edition 2.4.5 ANSI
0.078125