Ganesh H S

Thoughts on open source technologies, search engine optimization, website security

Installing wget on mac os

To install wget on mac os, make sure following packages are installed -

  • `gcc’,
  • `glibc-devel’ (or `libc6-dev’)
  • `make’

Once you have installed the pre-requisite packages, follow the below steps for installing wget on mac os -

  1. Get the wget source from http://ftp.gnu.org/gnu/wget/
  2. Download the wget-1.10.1.tar.gz or any latest version of wget
  3. Extract the downloaded file.
  4. $cd wget-1.10.1
  5. $./configure
  6. $make
  7. $sudo make install

Reference Links -
GNU wget documentation

submit site to google yahoo dmoz msn

You had a plan for a business, you need a website, now the website is done. What next?

How do you inform search engines that your website existed and inform them to index your website?

When i started working on Search Engine Optimization ( SEO ) for 3 ecommerce sites in 2006, this was the first question i had in mind.

Following are the ways of getting your website indexed by search engines -

set preferred domain

I always thought following links are same -
http://ganeshhs.com/search-engine-optimization-seo/noindex-nofollow
http://www.ganeshhs.com/search-engine-optimization-seo/noindex-nofollow

Above links leads to the same page, but it differs with www.
But search engine treats both links are different, i have seen in few cases where we link many a times we ignore www. and in some cases we do include www. in the links.

So what are the impacts?

  1. Search engines keep both the versions of the URLs, when people click on search engine results links which leads to our site with different versions of these URLs, it will drastically affect the page rank and traffic.
  2. These URLs look like different documents to crawlers and create excessive crawling on our website.

How do we instruct search engine to treat both the URL’s as same, Google webmasters tool has a option to set the preferred domain

So whats the advantage of set preferred domain ? If i set my preferred domain as ganeshhs.com and next time if Google comes and crawls my website, and if it finds any link starting with www.ganeshhs.com it will follow it as ganeshhs.com and when Google displays my website links in search results it will show the links as ganeshhs.com

It also helps us to fix the external site referrals, few guys started provide links to my website, if suppose their referral link is http://www.ganeshhs.com/category/search-engine-optimization-seo where as my actual article URL was http://ganeshhs.com/category/search-engine-optimization-seo and when google crawls our website through that referral link it will keep the right version of domain what we preferred.

noindex nofollow

HTML tag tells robots not to index the content of a page, and/or not scan it for links to follow, keeping this metatag for pages which we don’t want to index, nor to follow the links on the webpage is helpful.

In some cases, we come across situations where we keep links to external sites. But what are the impacts of this?

  1. Part of page rank is shared to external website -
    When we link to other websites, our part of our website page rank is shared to
    those external sites, and we may end up sending the search engine crawlers to other side.
  2. Leading Search Engine Crawlers to crawl external website -
    Crawler entered our website to crawl more pages, it will help us to have more indexes in Search Engines, but what did we end up keeping external links, we created a way to Crawler to leave our website and crawl the external websites.

We have to keep external links, but how do we prevent the above scenario -

  • If google.com is a external link, we could use < a href=”http://www.google.com” rel=”noindex, nofollow” > , when the crawler comes across this external link, it tells the crawler not crawl or follow that link.
  • ganeshhs.com google page rank

    My blog site ganeshhs.com has now Google Page Rank of 2/10.
    ganesh-h-s-google-page-rank

    When i started first project with zend framework may 2007, there were very few articles/tutorials and my first point of getting info was using search engine, then i realised it would be a great idea if my articles list in search engine and my first eye was on search engine optimization.

    Looking at my website analytics i noticed that my recent posts on zend lucene search had more number of unique visits which also increased my website daily visits to average of 100 visits (with more unique visits), and also i started getting backlinks from other websites(namely http://www.phpimpact.com/ etc.) which also contributed for this page rank.

    More essentially keywords(relevant to the context of the website/article) helps the articles to get indexed by search engines, following lists some of the blog articles and keywords i targeted and their stats in search engines Yahoo!/Google -

    Keyword Google Position Yahoo! Position
    Zend Lucene Search Page 1 Page 1
    Zend Auth Page 1 Page 2
    Zend Registry Page 1 -
    Zend Debug Page 1 -
    Zend Exception Page 1 -
    Zend Config Page 1 -
    Zend Loader Page 1 -

    Zend Lucene Search - part4 - Search Results Highlighting

    Zend_Search_Lucene_Search_Query::highlightMatches() method allows the developer to highlight HTML document terms in the context of a search query.

    In the previous article Zend Lucene Search - part3 - retrieving the indexed data i talked about retrieving the search results. When we search, highlighting the searched keyword in the search result is one of the important aspect which most search engines follow, in this article i will be writing about highlighting the search results retrieved using the zend lucene search.

    <?phprequire_once ‘Zend/Search/Lucene.php’;$queryStr= "php";
    
    $query = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);
    
    $index = Zend_Search_Lucene::open("/var/www/lucene-data/blog-index");
    
    $results = $index->find($query);
    
    echo "Index contains ".$index->count()." documents.\n\n";
    
    if($index->count())
    
    {
    
    $count = 0;
    
    foreach ($results as $result)
    
    {
    
    $data[$count]["article_url"]         = $result->url;
    
    $data[$count]["article_title"]        = $query->highlightMatches($result->title);
    
    $data[$count]["article_description"]        = $query->highlightMatches($result->contents);
    
    $data[$count]["article_created_date_time"]    = $result->postedDateTime;
    
    $data[$count]["article_id"]             = $result->articleId;
    
    $count++;
    
    }
    
    }
    
    print_R($data);
    
    ?>

    This program is same as in the Zend Lucene Search - part3 - retrieving the indexed data only one thing differs is now i am calling highlightMatches for the search results returned.

    Related articles:
    Zend Lucene Search - part1 - creating index
    Zend Lucene Search - part2 - Real time indexing
    Zend Lucene Search - part3 - retrieving the indexed data
    Home Page

    Zend Lucene Search - part3 - retrieving the indexed data

    Once the index is created, we are ready use zend lucene search to search the website. In the following example, php is the search keyword used to fetch the relevant search results in the already indexed data.

    <?php
    
    require_once ‘Zend/Search/Lucene.php’;$query = "php";
    
    $index = Zend_Search_Lucene::open("/var/www/lucene-data/blog-index");
    
    $results = $index->find($query);
    
    echo "Index contains ".$index->count()." documents.\n\n";
    
    if($index->count())
    
    {
    
    $count = 0;
    
    foreach ($results as $result)
    
    {
    
    $data[$count]["article_url"]         = $result->url;
    
    $data[$count]["article_title"]        = $query->highlightMatches($result->title);
    
    $data[$count]["article_description"]        = $query->highlightMatches($result->contents);
    
    $data[$count]["article_created_date_time"]    = $result->postedDateTime;
    
    $data[$count]["article_id"]             = $result->articleId;
    
    $count++;
    
    }
    
    }
    
    print_R($data);
    
    ?>

    To retrieve the index data, first thing we need to do is to open the indexed path.

    $index = Zend_Search_Lucene::open("/var/www/lucene-data/blog-index");

    Suppose if user search input is -

    $query = "php";

    We have to use the find method of zend search lucene -

    $results = $index->find($query);

    To retrieve the total records resulted in the search result, we have to use count method of zend lucene search -

    echo "Index contains ".$index->count()." documents.\n\n";

    To limit the search result count we have to use setResultSetLimit of zend lucene search -

    $index->setResultSetLimit(10);

    Related articles:
    Zend Lucene Search - part1 - creating index
    Zend Lucene Search - part2 - Real time indexing
    Zend Lucene Search - part4 - Search Results Highlighting
    Home Page

    Zend Lucene Search - part2 - Real time indexing

    For creating the index from the existing data, we need to create the index. Isn’t it a better idea to index each data when its created, to index the real time data we need to open the index which was created earlier, rest of the other things remains same as discussed in the Zend lucene Search - part1.

     <?phprequire_once 'Zend/Search/Lucene.php';$index = Zend_Search_Lucene::open("/var/www/lucene-data/blog-index");
    
    $doc = new Zend_Search_Lucene_Document();
    
    $doc->addField(Zend_Search_Lucene_Field::Keyword('url',
    
    "http://ganeshhs.com/url-3"));
    
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('articleId',
    
    3));
    
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('postedDateTime',
    
    "20007-12-29 01:40:00"));
    
    $doc->addField(Zend_Search_Lucene_Field::Text('title',
    
    "Porting PHP to Javascript : php js"));
    
    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
    
    "During graduation got interested in web technology, to kick start i started reading html, javascript."));
    
    $doc->addField(Zend_Search_Lucene_Field::Text('category',
    
    "Javascript"));
    
    $index->addDocument($doc);
    
    $index->commit();
    
    $index->optimize();

    Related articles:
    Zend Lucene Search - part1 - creating index
    Zend Lucene Search - part3 - retrieving the indexed data
    Zend Lucene Search - part4 - Search Results Highlighting
    Home Page

    Zend Lucene Search - part1 - creating index

    In this article i will be discussing about creating index using zend lucene search .

    Conventionally most of the site search are powered by database driven.

    Lets consider my blog site, if anyone comes to my site and wants to search for any keyword, if i have to give search results i may have to look into articles table, comments table, executing SQL queries against 2 tables is acceptable, but if we go to any e-commerce application, we may have to search against lot of categories and products, since database queries are costlier, it consumes more resources. One more important point is we cannot get more relevant results first, in general we cannot rank the search results.

    Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. This is being used in most of web2.0 websites. Zend_Search_Lucene was derived from the Apache Lucene project.

    <?php//Index the blog articles
    
    require_once 'Zend/Search/Lucene.php';
    
    $articlesData =    array (0 => array( "url"           => "http://ganeshhs.com/url-1",
    
    "title"	     => "Google suggest : pick right search keyword",
    
    "contents"	 => "Picking the right keywords for the websites is the success of search engine marketing. When i started search engine optimization, i used to use overture keyword selector tool and check the search counts what other users have searched. "
    
    "category"	     => "Google",
    
    "postedDateTime" => "2007-12-26 12:20:00",
    
    "articleId"            	     => 1),
    
    1 => array( "url"           => "http://ganeshhs.com/url-2",
    
    "title"	     => "zend framework tutorial | part 9 Zend Auth",
    
    "contents"	 => "Zend Auth is easy to set up and provides a system that secures our site with an easy to use  authentication mechanism. Zend Auth(Zend_Auth) provides an API for authentication. "
    
    "category"	     => "zend-framework",
    
    "postedDateTime" => "2007-12-26 12:20:00",
    
    "articleId"	     => 2));
    
    if(is_array($articlesData) && count($articlesData))
    
    {
    
    $index = Zend_Search_Lucene::create('/var/www/lucene-data/blog-index');
    
    foreach($articlesData as $articleData)
    
    {
    
    $doc = new Zend_Search_Lucene_Document();
    
    $doc->addField(Zend_Search_Lucene_Field::Keyword('url',
    
    $articleData["url"]));
    
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('articleId',
    
    $articleData["articleId"]));
    
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('postedDateTime',
    
    $articleData["postedDateTime"]));
    
    $doc->addField(Zend_Search_Lucene_Field::Text('title',
    
    $articleData["title"]));
    
    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
    
    $articleData["contents"]));
    
    $doc->addField(Zend_Search_Lucene_Field::Text('category',
    
    $articleData["category"]));
    
    echo "
    
    Adding: ". $articleData["title"] ."\n";
    
    $index->addDocument($doc);
    
    }
    
    $index->commit();
    
    $index->optimize();
    
    }
    
    ?>

    $index = Zend_Search_Lucene::create(’/var/www/lucene-data/blog-index’);
    Specifies the path of zend lucene index where the documents will be store.

    For each iteration, we are creating a document-

    $doc = new Zend_Search_Lucene_Document();

    Once the document is created we need to add the fields and contents to the document -

    Here since the URL is unique to the article we are indexing it as a Keyword field type.

    we may need blog article id and blog create date time in the display part, it wont be used for search so we are storing it as UnIndexed field type.

    Title is stored as text field type.

    Content/Description is indexed but not stored in index. Because description occupies more space and creates a larger index on disk, so if we need to search but not redisplay the data, UnStored field type is preferred.

    $doc->addField(Zend_Search_Lucene_Field::Keyword(’url’,
    
    $articleData[”url”]));$doc->addField(Zend_Search_Lucene_Field::UnIndexed(’articleId’,
    
    $articleData[”articleId”]));
    
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed(’postedDateTime’,
    
    $articleData[”postedDateTime”]));
    
    $doc->addField(Zend_Search_Lucene_Field::Text(’title’,
    
    $articleData[”title”]));
    
    $doc->addField(Zend_Search_Lucene_Field::UnStored(’contents’,
    
    $articleData[”contents”]));
    
    $doc->addField(Zend_Search_Lucene_Field::Text(’category’,
    
    $articleData[”category”]));

    Once the document is created and fields are added we need to add the document to the index -

    $index->addDocument($doc);

    After all the iterations we can commit the index-

    $index->commit();

    Following command is used to optimize the index -

    $index->optimize();

    Understanding Field Types -

  • Keyword fields are stored and indexed, meaning that they can be searched as well as displayed in search results. They are not split up into separate words by tokenization. Enumerated database fields usually translate well to Keyword fields in Zend_Search_Lucene.
  • UnIndexed fields are not searchable, but they are returned with search hits. Database timestamps, primary keys, file system paths, and other external identifiers are good candidates for UnIndexed fields
  • Binary fields are not tokenized or indexed, but are stored for retrieval with search hits. They can be used to store any data encoded as a binary string, such as an image icon.
  • Text fields are stored, indexed, and tokenized. Text fields are appropriate for storing information like subjects and titles that need to be searchable as well as returned with search results.
  • UnStored fields are tokenized and indexed, but not stored in the index. Large amounts of text are best indexed using this type of field. Storing data creates a larger index on disk, so if you need to search but not redisplay the data, use an UnStored field. UnStored fields are practical when using a Zend_Search_Lucene index in combination with a relational database. You can index large data fields with UnStored fields for searching, and retrieve them from your relational database by using a separate field as an identifier.
  • Field Type Stored Indexed Tokenized Binary
    Keyword yes yes no no  
    UnIndexed yes no no no  
    Binary yes no no yes  
    Text yes yes yes no  
    UnStored no yes yes no  

    Related articles:
    Zend Lucene Search - part2 - Real time indexing
    Zend Lucene Search - part3 - retrieving the indexed data
    Zend Lucene Search - part4 - Search Results Highlighting
    Home Page

    Google experiment

    Google and yahoo always comes up with innovative ideas. Yahoo’s YSlow is one of such innovative idea which helps us many web developers to improvise the performance of the website.

    Recently came across a article about Google experiment, its a experiment any user can take, its aimed at improving the search experience.

    Experiments allows us to join any one of following experiment at a time -

    1. Alternate views for search results.
    2. Keyword suggestions.
    3. Keyboard shortcuts.
    4. Left-hand search navigation.
    5. Right-hand contextual search navigation.

    This experiment with keyword suggestions is same as google suggest, i preferred taking that as my first experiment.

    « Previous Entries