SWISH-E
I’ve been looking for a new search engine for ATPM, and right now the leading candidate is SWISH-E. Here are some of the reasons I like it:
- It can index PDF documents if you have Xpdf installed. This is especially important for ATPM because our older content is not available in HTML format.
- It can build its index using a Web spider rather than by scanning the local file system. That way, it can also index the dynamic content of the pages.
- I can tell it not to index certain parts of the pages, e.g. the table of contents in the navigation bar.
- I can tell it not to index URLs that match particular patterns. For instance, the printing versions of the pages have URLs that end with “?print” and should not be indexed.
- It has good documentation and examples.
- It can be installed without logging in as root or writing to any directories outside $HOME.
There are many free search engines, but ones that have this combination of features are rare (judging from my quick search). Installation took a while, as I had to first install libxml2 and Xpdf, and then wrestle with why SWISH-E couldn’t find pdftotext or pdfinfo even though they were in the path. Once installed, it seems to work well. By tomorrow it should be done indexing, and then I can try some real tests.
5 Comments RSS · Twitter
Dear Michael,
Even I am struggling with Swish-e & pdf indexing.
I am a no linux programmer, i can though look a bit around on the net & configure things in linux.
With swish-e the problem with me is tht , i don't know how to get started. I have read the docs & visited the prog-bin dir for examples. But wht i request you is to pls provide me with sample conf files for pdf indexing, & wht small tid-bits of info you could offer so tht i can index my site as well.
pls help
Regards
dharmesh
Here's my spider.conf file:
# Example spider configuration file to index the
# split version of the swish-e documentation
@servers = (
{
base_url => ' http://www.atpm.com',
same_hosts => [ qw!atpm.com http://www.atpm.com/ www.atpm.com/! %5D,
email => 'swish-e@atpm.com',
delay_min => .0001,
# Define call-back functions to fine-tune the spider
test_url => sub {
my $uri = shift;
return 1; # otherwise, ok to search
},
# Only index text/html or text/plain
test_response => sub {
my ( $uri, $server, $response ) = @_;
return $response->content_type =~ m[(?:text/html|text/plain|application/pdf)];
},
filter_content => [ \&pdf ],
},
);
use pdf2html; # included example pdf converter module
sub pdf {
my ( $uri, $server, $response, $content_ref ) = @_;
return 1 unless $response->content_type eq 'application/pdf';
# for logging counts
$server->{counts}{'PDF transformed'}++;
$$content_ref = ${pdf2html( $content_ref, 'title' )};
$$content_ref =~ tr/ / /s;
return 1;
}
1;
don't bother with swish-e if you're using windows. i've been trying to get it running for the last two days and am giving up now. easy to get it working from command line. impossible to get it running from a web site. finally got it running. no results. no help in doc or on site. delete *.*
Where can I find an easy to follow tutorial on swish-e. I just want to use it to spider my site and give me my metadata only. From what I've seen it can do it, just haven't a clue how to go about it.