Sessions
Sessions and search engines
The use of sessions is necessary in many situations, especially for:
- Tracking the path of the visitor of the website
- Memorizing some information (such as the content of a basket)
The information related to the session is saved in a file on the server. This file is unique and an ID is generated for each user. This ID is a text sequence of, at least, 32 characters. It makes it possible to link the user to the file containing the session information.
The ID can be stored in a cookie (on the user's computer, then) This is the most common situation. However, if the user does not accept cookies, it is possible to transmit this ID through the URL. You have probably already seen URLs with a sequence such as: ?PHPSESSID=a9e8dc705da1560e9b6d4c1a65ae3245.
Sessions issues for SEO
It is good yo know that search engines don't accept cookies. Consequently, numerous websites work around this problem by using URLs to transmit the session ID, and this is where the problems begin.
Robots index this kind of URL. Actually, we can check it out easily. To do so, you only need to enter the following query in the search engine of your choice.
In general (it depends on the configuration of the server), they keep the session files for 30 minutes. Thus, when the bot is coming back, a new session will be opened with a new URL to index. Please note that the content will be identical to the page previously crawled by the bot. Bots will therefore be able to index a hundred times the same web page.
As you can imagine, after a while, search engines detect 2 main problems:
- Non-lasting URL
- Duplicate content
- Endless number of pages to index
A solution to make the sessions disappear?
Saying that sessions should not be used is not a conceivable solution. Indeed, sessions can be very convenient for developers as we have seen at the beginning of this article. So, how do I remove session IDs from URLs? Several solutions are possible:
- Open the session only when it is necessary. Some websites open a session since the first page's visit whereas they really use it after the user is logged in.
- Prioritize the ID through cookies and not through the URL
- Ignore users who do not accept cookies and forbid the passage of IDs in URLs.
- Detect the bots and start a session only with internauts.
Forbid the IDs with session in URL with htaccess
SetEnv SESSION_USE_TRANS_SID 0
Forbid the session IDs in URL through php.ini
session.use_trans_sid = 0
As for the IIS servers, I have already seen a network admin fix the problem through a simple option of their admin panel.
Forbid the passage of session IDs in the URL
/* Disabling recognition of the session ID in the URL */
ini_set('session.use_trans_sid', "0");
/* Permission to use cookies */
ini_set('session.use_cookies', "1");
/* Authorization to use ONLY cookies */
ini_set('session.use_only_cookies', "1"); */
/* Prohibition to add the session ID in the HTML code generated */
ini_set("url_rewriter.tags","");
/* Everything is under control, we start the session */ session_start();
It would be possible to make it shorter, but I have delivered the code to you here like a paranoid 🙂
Detect the robots in PHP
There are many methods to detect the robots. For example, it is possible to control the user agent, the host and the IP of each visitor and then start or not a session. Here is a proposal to control the user agent only:
<?php
function checkUaRobot(){
$robot = false;
$_UA = array("GoogleBot", "Slurp","MsnBot");
// Complete this list with all the UA's your require
foreach($_UA as $ua) {
// We compare the visitor's user agent to our list
if(eregi($ua,$_SERVER["HTTP_USER_AGENT"])) return true;
}
// The UA is not in our list, so it is a human visitor
return false;
}
?>
// We call the function
if (!checkUaRobot()) { session_start(); }
?>
A last 'solution' would be to let Google dead with it. It would clean it up by itself one day or another by deleting the duplicate pages... However, this is not recommended.
Next : Forms
Previous : Technical SEO