Sunday, January 19, 2003

SWISH-E

I’ve been looking for a new search engine for ATPM, and right now the leading candidate is SWISH-E. Here are some of the reasons I like it:

There are many free search engines, but ones that have this combination of features are rare (judging from my quick search). Installation took a while, as I had to first install libxml2 and Xpdf, and then wrestle with why SWISH-E couldn’t find pdftotext or pdfinfo even though they were in the path. Once installed, it seems to work well. By tomorrow it should be done indexing, and then I can try some real tests.

5 Comments RSS · Twitter

Dear Michael,

Even I am struggling with Swish-e & pdf indexing.

I am a no linux programmer, i can though look a bit around on the net & configure things in linux.

With swish-e the problem with me is tht , i don't know how to get started. I have read the docs & visited the prog-bin dir for examples. But wht i request you is to pls provide me with sample conf files for pdf indexing, & wht small tid-bits of info you could offer so tht i can index my site as well.

pls help

Regards

dharmesh

Here's my spider.conf file:

# Example spider configuration file to index the

# split version of the swish-e documentation

@servers = (

      {

        base_url    => ' http://www.atpm.com',

        same_hosts    => [ qw!atpm.com http://www.atpm.com/ www.atpm.com/! %5D,

        email     => 'swish-e@atpm.com',

        delay_min   => .0001,

        # Define call-back functions to fine-tune the spider

        test_url    => sub {

          my $uri = shift;

          return 1;  # otherwise, ok to search

        },

        # Only index text/html or text/plain

         test_response   => sub {

            my ( $uri, $server, $response ) = @_;

            return $response->content_type =~ m[(?:text/html|text/plain|application/pdf)];

         },

         filter_content  => [ \&pdf ],

    },

  );

use pdf2html;  # included example pdf converter module

sub pdf {

   my ( $uri, $server, $response, $content_ref ) = @_;

   return 1 unless $response->content_type eq 'application/pdf';

   # for logging counts

   $server->{counts}{'PDF transformed'}++;

   $$content_ref = ${pdf2html( $content_ref, 'title' )};

   $$content_ref =~ tr/ / /s;

   return 1;

}

1;

don't bother with swish-e if you're using windows. i've been trying to get it running for the last two days and am giving up now. easy to get it working from command line. impossible to get it running from a web site. finally got it running. no results. no help in doc or on site. delete *.*

Where can I find an easy to follow tutorial on swish-e. I just want to use it to spider my site and give me my metadata only. From what I've seen it can do it, just haven't a clue how to go about it.

Leave a Comment