PHP - Making a search engine

From Global Programming Syntax

Jump to: navigation, search

Contents

A sample script

<form method="post">Scan site: <input type="text" name="site" value="http://" style="width:300px">
<input value="Scan" type="submit"></form>
<?
set_time_limit (0);
if (isset($_POST['site']) && !empty($_POST['site'])) {
/* Formats Allowed */
$formats=array('html'=>true,'htm'=>true,'xhtml'=>true,'xml'=>true,'mhtml'=>true,'xht'=>true,
'mht'=>true,'asp'=>true,'aspx'=>true,'adp'=>true,'bml'=>true,'cfm'=>true,'cgi'=>true,
'ihtml'=>true,'jsp'=>true,'las'=>true,'lasso'=>true,'lassoapp'=>true,'pl'=>true,'php'=>true,
'php1'=>true,'php2'=>true,'php3'=>true,'php4'=>true,'php5'=>true,'php6'=>true,'phtml'=>true,
'shtml'=>true,'search'=>true,'query'=>true,'forum'=>true,'blog'=>true,'1'=>true,'2'=>true,
'3'=>true,'4'=>true,'5'=>true,'6'=>true,'7'=>true,'8'=>true,'9'=>true,'10'=>true,'11'=>true,
'12'=>true,'13'=>true,'14'=>true,'15'=>true,'16'=>true,'17'=>true,'18'=>true,'19'=>true,
'20'=>true,'01'=>true,'02'=>true,'03'=>true,'04'=>true,'05'=>true,'06'=>true,'07'=>true,
'08'=>true,'09'=>true,'go'=>true,'page'=>true,'file'=>true);
 
function domain ($ddomain) {
return preg_replace('/^((http(s)?:\/\/)?([^\/]+))(.*)/','$1',$ddomain);
}
 
function url_exists($durl)
{
// Version 4.x supported
$handle = curl_init($durl);
if (false === $handle)
{
return false;
}
curl_setopt($handle, CURLOPT_HEADER, true);
curl_setopt($handle, CURLOPT_FAILONERROR, true); // this works
curl_setopt($handle, CURLOPT_HTTPHEADER,
Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") );
curl_setopt($handle, CURLOPT_NOBODY, true);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$connectable = curl_exec($handle);
curl_close($handle);
if (stripos(substr_replace($connectable,'',30),'200 OK')) {
return true;
} else {
return false;
}
}
$fdata='';
//below function will only get links within own domain and not links outside the site.
function getlinks($generateurlf) {
global $formats;
global $f_data;
$f_data=file_get_contents($generateurlf);
$datac=$f_data;
preg_match_all('/(href|src)\=(\"|\')([^\"\'\>]+)/i',$datac,$media);
unset($datac);
$datac=$media[3];
unset($media);
$datab=array();
$str_start=array('http'=>true,'www.'=>true);
foreach($datac AS $dfile) {
$generateurle=$generateurlf;
$format=strtolower(preg_replace('/(.*)[.]([^.\?]+)(\?(.*))?/','$2',basename($generateurle.$dfile)));
if (!isset($str_start[substr_replace($dfile,'',4)])) {
if (substr_replace($generateurle,'',0, -1)!=='/') {
$generateurle=preg_replace('/(.*)\/[^\/]+/is', "$1", $generateurle);
} else {
$generateurle=substr_replace($generateurle,'',-1);
}
 
if (substr_replace($dfile,'',1)=='/') {
if (domain($generateurle)==domain($generateurle.$dfile)) {
if (isset($formats[$format])
|| substr($generateurle.$dfile,-1)=='/' || substr_count(basename($generateurle.$dfile),'.')==0) {
$datab[]=$generateurle.$dfile;
}
}
} else if (substr($dfile,0,2)=='./') {
$dfile=substr($dfile,2);
if (isset($formats[$format])) {$datab[]=$generateurle.'/'.$dfile;}
} else if (substr_replace($dfile,'',1)=='.') {
while (preg_match('/\.\.\/(.*)/i', $dfile)) {
$dfile=substr_replace($dfile,'',0,3);
$generateurle=preg_replace('/(.*)\/[^\/]+/i', "$1", $generateurle);
}
if (domain($generateurle)==domain($generateurle.'/'.$dfile)) {
if (isset($formats[$format]) || substr($generateurle.'/'.$dfile,-1)=='/'
|| substr_count(basename($generateurle.'/'.$dfile),'.')==0) {
$datab[]=$generateurle.'/'.$dfile;
}
}
} else {
if (domain($generateurle)==domain($generateurle.'/'.$dfile)) {
if (isset($formats[$format]) || substr($generateurle.'/'.$dfile,-1)=='/'
|| substr_count(basename($generateurle.'/'.$dfile),'.')==0) {
$datab[]=$generateurle.'/'.$dfile;
}
}
}
} else {
if (domain($generateurle)==domain($dfile)) {
if (isset($formats[$format]) || substr($dfile,-1)=='/' || substr_count(basename($dfile),'.')==0) {
$datab[]=$dfile;
}
}
}
unset($format);
}
unset($datac);
unset($dfile);
return $datab;
}
 
 
 
 
 
//=============================================
/* Modify only code between these two lines and $formats variable above. */
 
function generate($url) {
echo $url.'<br>';
global $f_data; //Data of file contents
//do something with webpage $f_data.
unset($f_data);
}
 
 
//=============================================
// Below is what actually process the search engine
$sites=array();
$sites[]=stripslashes($_POST['site']);
for ($i=0;isset($sites[$i]);$i++) {
foreach (getlinks(stripslashes($sites[$i])) AS $val) {
if (!isset($sites[$val])) {
$sites[]=$val;
$sites[$val]=true;
}
} unset($val);
if (url_exists($sites[$i])) {
generate($sites[$i]);
flush();
}
}
}
?>

Piecing it together

Initial functions needed

When setting up a search engine bot, you will need to make a few functions for the core of the bot to work. These functions simply check if a url is valid, retrieves links from a page, converts a url to domain and the fourth function processes each page it comes across. So below are the functions which are required.

function domain ($ddomain) {
return preg_replace('/^((http(s)?:\/\/)?([^\/]+))(.*)/','$1',$ddomain);
}

The above function will convert the input url to the domain and the below function will check if the specified url exists.

function url_exists($durl)
{
// Version 4.x supported
$handle = curl_init($durl);
if (false === $handle)
{
return false;
}
curl_setopt($handle, CURLOPT_HEADER, true);
curl_setopt($handle, CURLOPT_FAILONERROR, true); // this works
curl_setopt($handle, CURLOPT_HTTPHEADER,
Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") );
curl_setopt($handle, CURLOPT_NOBODY, true);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$connectable = curl_exec($handle);
curl_close($handle);
if (stripos(substr_replace($connectable,'',30),'200 OK')) {
return true;
} else {
return false;
}
}

The getlinks function (below) will fetch all the links inside the page and return the links with valid extensions as an array. Also this function will only return links within the same domain.

function getlinks($generateurlf) {
global $formats;
global $f_data;
$f_data=file_get_contents($generateurlf);
$datac=$f_data;
preg_match_all('/(href|src)\=(\"|\')([^\"\'\>]+)/i',$datac,$media);
unset($datac);
$datac=$media[3];
unset($media);
$datab=array();
$str_start=array('http'=>true,'www.'=>true);
foreach($datac AS $dfile) {
$generateurle=$generateurlf;
$format=strtolower(preg_replace('/(.*)[.]([^.\?]+)(\?(.*))?/','$2',basename($generateurle.$dfile)));
if (!isset($str_start[substr_replace($dfile,'',4)])) {
if (substr_replace($generateurle,'',0, -1)!=='/') {
$generateurle=preg_replace('/(.*)\/[^\/]+/is', "$1", $generateurle);
} else {
$generateurle=substr_replace($generateurle,'',-1);
}
 
if (substr_replace($dfile,'',1)=='/') {
if (domain($generateurle)==domain($generateurle.$dfile)) {
if (isset($formats[$format])
|| substr($generateurle.$dfile,-1)=='/' || substr_count(basename($generateurle.$dfile),'.')==0) {
$datab[]=$generateurle.$dfile;
}
}
} else if (substr($dfile,0,2)=='./') {
$dfile=substr($dfile,2);
if (isset($formats[$format])) {$datab[]=$generateurle.'/'.$dfile;}
} else if (substr_replace($dfile,'',1)=='.') {
while (preg_match('/\.\.\/(.*)/i', $dfile)) {
$dfile=substr_replace($dfile,'',0,3);
$generateurle=preg_replace('/(.*)\/[^\/]+/i', "$1", $generateurle);
}
if (domain($generateurle)==domain($generateurle.'/'.$dfile)) {
if (isset($formats[$format]) || substr($generateurle.'/'.$dfile,-1)=='/'
|| substr_count(basename($generateurle.'/'.$dfile),'.')==0) {
$datab[]=$generateurle.'/'.$dfile;
}
}
} else {
if (domain($generateurle)==domain($generateurle.'/'.$dfile)) {
if (isset($formats[$format]) || substr($generateurle.'/'.$dfile,-1)=='/'
|| substr_count(basename($generateurle.'/'.$dfile),'.')==0) {
$datab[]=$generateurle.'/'.$dfile;
}
}
}
} else {
if (domain($generateurle)==domain($dfile)) {
if (isset($formats[$format]) || substr($dfile,-1)=='/' || substr_count(basename($dfile),'.')==0) {
$datab[]=$dfile;
}
}
}
unset($format);
}
unset($datac);
unset($dfile);
return $datab;
}

Now the below function (generate) is the function that actually indexes/records the page. This is the only function that will need changing. At the moment it just displays the link but it is possible to record both the link and a short description in a database or even record media in the page. Just remember that any variables used you will need to use the unset($variable); function on. This way it will free up memory.

function generate($url) {
echo $url.'<br>';
global $f_data; //Data of file contents
//do something with webpage $f_data.
unset($f_data);
}


Changes to make for your customised bot

The only changes that really need to be made to the sample source code above is the following section.

function generate($url) {
echo $url.'<br>';
$data=file_get_contents($url);
//do something with webpage $data.
unset($data);
}

This section is what actually indexes your bot scan results. How? The entire html page is contained in the $data variable and by using regex/preg_ syntax it is possible to make short descriptions and even find media to index.

Conclusion

The above descriptions give you the basic skeleton of building a bot and is very simple to expand. So it is only the generate() function that needs modifying to suite your needs. The rest of the script is the core of the bot which has been done for you. So have fun making bots to scan your websites but be aware that it will take a lot of time to scan the entire internet.

Personal tools
languages
page stats
Toolbox