Unoffical empeg BBS

Quick Links: Empeg FAQ | RioCar.Org | Hijack | BigDisk Builder | jEmplode | emphatic
Repairs: Repairs

Topic Options
#291088 - 03/12/2006 22:37 Document Indexer - Any ideas?
Mach
old hand

Registered: 15/07/2002
Posts: 828
Loc: Texas, USA
I've got a 100 (or so) 50 page MS word documents. They're actually project proposals for software implementations. I'd like to index them in some smart way so that when our sales guys have to respond to a future proposal request they can easily search for how someone else has responded in a previous proposal without knowing in which document the response is in.

The challenge is that the documents are not the same in structure, questions, or actual response. They are similar so that you could say that page 2 paragraph 3 in document 1 is similar in intent to page 32 paragraph 5 and 6 of document 5.

First, is there a technology/name for what I'm trying to describe? Short of tagging each paragraph with key words, is there a way to do this, either manually or automatically? I'd really like to find a search tool that could be trained to say questions of type of A are connected to responses like type B.

I've done some googling but not exactly sure what to call what I'm looking for. Any ideas?

Top
#291089 - 04/12/2006 00:10 Re: Document Indexer - Any ideas? [Re: Mach]
wfaulk
carpal tunnel

Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
My first thought is to convert them all to PDF and then create a PDF catalog.
_________________________
Bitt Faulk

Top
#291090 - 04/12/2006 01:19 Re: Document Indexer - Any ideas? [Re: wfaulk]
Mach
old hand

Registered: 15/07/2002
Posts: 828
Loc: Texas, USA
Thanks Bitt! I need to upgrade my version of Acrobat but that should get me close.

Top
#291091 - 04/12/2006 01:30 Re: Document Indexer - Any ideas? [Re: Mach]
cushman
veteran

Registered: 21/01/2002
Posts: 1380
Loc: Erie, CO
Indexing Word documents for plain search or natural language search is fairly easy, there are many packages commercial and open that will do this. I think what you are asking for is a way to identify concepts and relationships between segments of a document, say paragraphs that have 50+ words. I have only personally seen one software package that will do this, IBM's Content Discovery Server, formally iPhrase. The idea is that you can try to define "buckets" of concepts, and this software can sort incoming segments of text into those buckets. The downside is that it won't readily identify concepts outside of your defined set.

IBM also has donated an open source method for indexing concepts also, hosted at the Apache UIMA site. I haven't looked at this too closely, but it probably wouldn't be easy or fast to set up in a production system and have something useful.

Your best bet is probably a search engine that will index Word documents and perform some basic natural language search techniques, like word variants, spell checking and stemming. When your sales guys want other documents to look at, search through the previous ones with the same keywords and present snippets of the documents in the results so they can quickly find what they are looking for. Generally a commercial natural language search engine will perform better than any free one.
_________________________
Mark Cushman

Top
#291092 - 04/12/2006 02:10 Re: Document Indexer - Any ideas? [Re: Mach]
altman
carpal tunnel

Registered: 19/05/1999
Posts: 3457
Loc: Palo Alto, CA
Even without a recent acrobat, ISTR you can do multi-file searches on a folder of PDFs for words/phrases using the free reader. Use the search button and you can select "all pdf documents in folder x".

You then get hyperlinked search results that open the right file to the right spot.

Hugo

Top
#291093 - 04/12/2006 02:11 Re: Document Indexer - Any ideas? [Re: cushman]
Mach
old hand

Registered: 15/07/2002
Posts: 828
Loc: Texas, USA
Thanks Mark. I'll corner our web team tomorrow and see what's possible.

Top
#291094 - 04/12/2006 02:13 Re: Document Indexer - Any ideas? [Re: Mach]
MarkH
member

Registered: 06/04/2000
Posts: 158
This also sounds like something that would be used by other professional firms, especially lawyers. Maybe try searching for "legal document indexing archive" or some such ?

Regards

Mark

Top
#291095 - 04/12/2006 07:26 Re: Document Indexer - Any ideas? [Re: Mach]
Anonymous
Unregistered


The correct course of action is to write up a project proposal in MS Word format describing a new software implementation that will categorize and index all previous project proposals. Once this project is given the greenlight, then you'll be able to easily find relevant project proposals in a timely manner. Oh, and go ahead and get those TPS reports on my desk by this afternoon. Thanks.

Top
#291096 - 04/12/2006 13:08 Re: Document Indexer - Any ideas? [Re: cushman]
wfaulk
carpal tunnel

Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
Quote:
IBM's Content Discovery Server, formally iPhrase

formally: following established rule
formerly: in the past
_________________________
Bitt Faulk

Top
#291097 - 04/12/2006 14:11 Re: Document Indexer - Any ideas? [Re: wfaulk]
cushman
veteran

Registered: 21/01/2002
Posts: 1380
Loc: Erie, CO
Quote:
formally: following established rule
formerly: in the past


I did mean formerly, and I am usually good about using the correct word. Whoops.

Their quiet good about indexing data, to.
_________________________
Mark Cushman

Top
#291098 - 04/12/2006 23:20 Re: Document Indexer - Any ideas? [Re: cushman]
Anonymous
Unregistered


Quote:
Their quiet good about indexing data, to.


too: also, in addition to, or indicating excessiveness
to: all other situations

quiet: little or no sound
quite: to a great extent

You did that on purpose, didn't you?

Top
#291099 - 04/12/2006 23:30 Re: Document Indexer - Any ideas? [Re: ]
wfaulk
carpal tunnel

Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
Did you miss their/they're on purpose?
_________________________
Bitt Faulk

Top
#291100 - 04/12/2006 23:32 Re: Document Indexer - Any ideas? [Re: wfaulk]
Anonymous
Unregistered


Quote:
Did you miss their/they're on purpose?


LOL, no.

Top