分享
 
 
 

网络爬虫+HtmlAgilityPack+windows服务从博客园爬取20万博文

王朝学院·作者佚名  2016-08-27
窄屏简体版  字體: |||超大  

1.前言最新在公司做一个项目,需要一些文章类的数据,当时就想到了用网络爬虫去一些技术性的网站爬一些,当然我经常去的就是博客园,于是就有下面的这篇文章。

程序源码:CSDN下载地址

2.准备工作我需要把我从博客园爬取的数据,保存起来,最好的方式当然是保存到数据库中去了,好了我们先建一个数据库,在来一张表,保存我们的数据,其实都很简单的了啊,如下图所示

BlogArticleId博文自增ID,BlogTitle博文标题,BlogUrl博文地址,BlogAuthor博文作者,BlogTime博文发布时间,BlogMotto作者座右铭,BlogDepth蜘蛛爬虫爬取的深度,IsDeleted是否删除。

数据库表也创建好了,我们先来一个数据库的帮助类。

///<summary>///数据库帮助类///</summary>publicclassMssqlHelper

{#region字段属性///<summary>///数据库连接字符串///</summary>PRivatestaticstringconn ="Data Source=.;Initial Catalog=Cnblogs;User ID=sa;PassWord=123";#endregion#regionDataTable写入数据publicstaticvoidGetData(stringtitle,stringurl,stringauthor,stringtime,stringmotto,stringdepth, DataTable dt)

{

DataRow dr;

dr=dt.NewRow();

dr["BlogTitle"] =title;

dr["BlogUrl"] =url;

dr["BlogAuthor"] =author;

dr["BlogTime"] =time;

dr["BlogMotto"] =motto;

dr["BlogDepth"] =depth;//2.0 将dr追加到dt中dt.Rows.Add(dr);

}#endregion#region插入数据到数据库///<summary>///插入数据到数据库///</summary>publicstaticvoidInsertDb(DataTable dt)

{try{using(System.Data.SqlClient.SqlBulkCopy copy =newSystem.Data.SqlClient.SqlBulkCopy(conn))

{//3.0.1 指定数据插入目标表名称copy.DestinationTableName ="BlogArticle";//3.0.2 告诉SqlBulkCopy对象 内存表中的 OrderNO1和Userid1插入到OrderInfos表中的哪些列中copy.ColumnMappings.Add("BlogTitle","BlogTitle");

copy.ColumnMappings.Add("BlogUrl","BlogUrl");

copy.ColumnMappings.Add("BlogAuthor","BlogAuthor");

copy.ColumnMappings.Add("BlogTime","BlogTime");

copy.ColumnMappings.Add("BlogMotto","BlogMotto");

copy.ColumnMappings.Add("BlogDepth","BlogDepth");//3.0.3 将内存表dt中的数据一次性批量插入到OrderInfos表中copy.WriteToServer(dt);

dt.Rows.Clear();

}

}catch(Exception)

{

dt.Rows.Clear();

}

}#endregion}

3.日志来个日志,方便我们查看,代码如下。

///<summary>///日志帮助类///</summary>publicclassLogHelper

{#region写入日志//写入日志publicstaticvoidWriteLog(stringtext)

{//StreamWriter sw = new StreamWriter(AppDomain.CurrentDomain.BaseDirectory + "\\log.txt", true);StreamWriter sw =newStreamWriter("F:"+"\\log.txt",true);

sw.WriteLine(text);

sw.Close();//写入}#endregion}

4.爬虫我的网络蜘蛛爬虫,用的一个第三方类库,代码如下。

namespaceFeng.SimpleCrawler

{usingSystem;///<summary>///The add url event handler.///</summary>///<param name="args">///The args.///</param>///<returns>///The<see cref="bool"/>.///</returns>publicdelegateboolAddUrlEventHandler(AddUrlEventArgs args);///<summary>///The add url event args.///</summary>publicclassAddUrlEventArgs : EventArgs

{#regionPublic Properties///<summary>///Gets or sets the depth.///</summary>publicintDepth {get;set; }///<summary>///Gets or sets the title.///</summary>publicstringTitle {get;set; }///<summary>///Gets or sets the url.///</summary>publicstringUrl {get;set; }#endregion}

}

AddUrlEventArgs.cs

namespaceFeng.SimpleCrawler

{usingSystem;usingSystem.Collections;///<summary>///The bloom filter.///</summary>///<typeparam name="T">///The generic type.///</typeparam>publicclassBloomFilter<T>{#regionFields///<summary>///The get hash secondary.///</summary>privatereadonlyHashFunction getHashSecondary;///<summary>///The hash bits.///</summary>privatereadonlyBitArray hashBits;///<summary>///The hash function count.///</summary>privatereadonlyinthashFunctionCount;#endregion#regionConstructors and Destructors///<summary>///Initializes a new instance of the<see cref="BloomFilter{T}"/>class.///</summary>///<param name="capacity">///The capacity.///</param>publicBloomFilter(intcapacity)

:this(capacity,null)

{

}///<summary>///Initializes a new instance of the<see cref="BloomFilter{T}"/>class.///</summary>///<param name="capacity">///The capacity.///</param>///<param name="errorRate">///The error rate.///</param>publicBloomFilter(intcapacity,interrorRate)

:this(capacity, errorRate,null)

{

}///<summary>///Initializes a new instance of the<see cref="BloomFilter{T}"/>class.///</summary>///<param name="capacity">///The capacity.///</param>///<param name="hashFunction">///The hash function.///</param>publicBloomFilter(intcapacity, HashFunction hashFunction)

:this(capacity, BestErrorRate(capacity), hashFunction)

{

}///<summary>///Initializes a new instance of the<see cref="BloomFilter{T}"/>class.///</summary>///<param name="capacity">///The capacity.///</param>///<param name="errorRate">///The error rate.///</param>///<param name="hashFunction">///The hash function.///</param>publicBloomFilter(intcapacity,floaterrorRate, HashFunction hashFunction)

:this(capacity, errorRate, hashFunction, BestM(capacity, errorRate), BestK(capacity, errorRate))

{

}///<summary>///Initializes a new instance of the<see cref="BloomFilter{T}"/>class.///</summary>///<param name="capacity">///The capacity.///</param>///<param name="errorRate">///The error rate.///</param>///<param name="hashFunction">///The hash function.///</param>///<param name="m">///The m.///</param>///<param name="k">///The k.///</param>publicBloomFilter(intcapacity,floaterrorRate, HashFunction hashFunction,intm,intk)

{if(capacity <1)

{thrownewArgumentOutOfRangeException("capacity", capacity,"capacity must be > 0");

}if(errorRate >=1|| errorRate <=0)

{thrownewArgumentOutOfRangeException("errorRate",

errorRate,string.Format("errorRate must be between 0 and 1, exclusive. Was {0}", errorRate));

}if(m <1)

{thrownewArgumentOutOfRangeException(string.Format("The provided capacity and errorRate values would result in an array of length > int.MaxValue. Please reduce either of these values. Capacity: {0}, Error rate: {1}",

capacity,

errorRate));

}if(hashFunction ==null)

{if(typeof(T) ==typeof(string))

{this.getHashSecondary =HashString;

}elseif(typeof(T) ==typeof(int))

{this.getHashSecondary =HashInt32;

}else{thrownewArgumentNullException("hashFunction","Please provide a hash function for your type T, when T is not a string or int.");

}

}else{this.getHashSecondary =hashFunction;

}this.hashFunctionCount =k;this.hashBits =newBitArray(m);

}#endregion#regionDelegates///<summary>///The hash function.///</summary>///<param name="input">///The input.///</param>///<returns>///The<see cref="int"/>.///</returns>publicdelegateintHashFunction(T input);#endregion#regionPublic Properties///<summary>///Gets the truthiness.///</summary>publicdoubleTruthiness

{get{return(double)this.TrueBits() /this.hashBits.Count;

}

}#endregion#regionPublic Methods andOperators///<summary>///The add.///</summary>///<param name="item">///The item.///</param>publicvoidAdd(T item)

{intprimaryHash =item.GetHashCode();intsecondaryHash =this.getHashSecondary(item);for(inti =0; i <this.hashFunctionCount; i++)

{inthash =this.ComputeHash(primaryHash, secondaryHash, i);this.hashBits[hash] =true;

}

}///<summary>///The contains.///</summary>///<param name="item">///The item.///</param>///<returns>///The<see cref="bool"/>.///</returns>publicboolContains(T item)

{intprimaryHash =item.GetHashCode();intsecondaryHash =this.getHashSecondary(item);for(inti =0; i <this.hashFunctionCount; i++)

{inthash =this.ComputeHash(primaryHash, secondaryHash, i);if(this.hashBits[hash] ==false)

{returnfalse;

}

}returntrue;

}#endregion#regionMethods///<summary>///The best error rate.///</summary>///<param name="capacity">///The capacity.///</param>///<returns>///The<see cref="float"/>.///</returns>privatestaticfloatBestErrorRate(intcapacity)

{varc = (float)(1.0/capacity);if(Math.Abs(c) >0)

{returnc;

}doubley =int.MaxValue / (double)capacity;return(float)Math.Pow(0.6185, y);

}///<summary>///The best k.///</summary>///<param name="capacity">///The capacity.///</param>///<param name="errorRate">///The error rate.///</param>///<returns>///The<see cref="int"/>.///</returns>privatestaticintBestK(intcapacity,floaterrorRate)

{return(int)Math.Round(Math.Log(2.0) * BestM(capacity, errorRate) /capacity);

}///<summary>///The best m.///</summary>///<param name="capacity">///The capacity.///</param>///<param name="errorRate">///The error rate.///</param>///<returns>///The<see cref="int"/>.///</returns>privatestaticintBestM(intcapacity,floaterrorRate)

{return(int)Math.Ceiling(capacity * Math.Log(errorRate,1.0/ Math.Pow(2, Math.Log(2.0))));

}///<summary>///The hash int 32.///</summary>///<param name="input">///The input.///</param>///<returns>///The<see cref="int"/>.///</returns>privatestaticintHashInt32(T input)

{varx = inputasuint?;unchecked{

x= ~x + (x <<15);

x= x ^ (x >>12);

x= x + (x <<2);

x= x ^ (x >>4);

x= x *2057;

x= x ^ (x >>16);return(int)x;

}

}///<summary>///The hash string.///</summary>///<param name="input">///The input.///</param>///<returns>///The<see cref="int"/>.///</returns>privatestaticintHashString(T input)

{varstr = inputasstring;inthash =0;if(str !=null)

{for(inti =0; i < str.Length; i++)

{

hash+=str[i];

hash+= hash <<10;

hash^= hash >>6;

}

hash+= hash <<3;

hash^= hash >>11;

hash+= hash <<15;

}returnhash;

}///<summary>///The compute hash.///</summary>///<param name="primaryHash">///The primary hash.///</param>///<param name="secondaryHash">///The secondary hash.///</param>///<param name="i">///The i.///</param>///<returns>///The<see cref="int"/>.///</returns>privateintComputeHash(intprimaryHash,intsecondaryHash,inti)

{intresultingHash = (primaryHash + (i * secondaryHash)) %this.hashBits.Count;returnMath.Abs(resultingHash);

}///<summary>///The true bits.///</summary>///<returns>///The<see cref="int"/>.///</returns>privateintTrueBits()

{intoutput =0;foreach(boolbitinthis.hashBits)

{if(bit)

{

output++;

}

}returnoutput;

}#endregion}

}

BloomFilter.cs

namespaceFeng.SimpleCrawler

{usingSystem;///<summary>///The crawl error event handler.///</summary>///<param name="args">///The args.///</param>publicdelegatevoidCrawlErrorEventHandler(CrawlErrorEventArgs args);///<summary>///The crawl error event args.///</summary>publicclassCrawlErrorEventArgs : EventArgs

{#regionPublic Properties///<summary>///Gets or sets the exception.///</summary>publicException Exception {get;set; }///<summary>///Gets or sets the url.///</summary>publicstringUrl {get;set; }#endregion}

}

CrawlErrorEventArgs.cs

namespaceFeng.SimpleCrawler

{usingSystem;///<summary>///The crawl error event handler.///</summary>///<param name="args">///The args.///</param>publicdelegatevoidCrawlErrorEventHandler(CrawlErrorEventArgs args);///<summary>///The crawl error event args.///</summary>publicclassCrawlErrorEventArgs : EventArgs

{#regionPublic Properties///<summary>///Gets or sets the exception.///</summary>publicException Exception {get;set; }///<summary>///Gets or sets the url.///</summary>publicstringUrl {get;set; }#endregion}

}

CrawlExtension.cs

namespaceFeng.SimpleCrawler

{usingSystem;usingSystem.Collections.Generic;usingSystem.IO;usingSystem.IO.Compression;usingSystem.Linq;usingSystem.Net;usingSystem.Text;usingSystem.Text.RegularExpressions;usingSystem.Threading;///<summary>///The crawl master.///</summary>publicclassCrawlMaster

{#regionConstants///<summary>///The web url regular expressions.///</summary>privateconststringWebUrlRegularExpressions =@"^(http|https)://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?";#endregion#regionFields///<summary>///The cookie container.///</summary>privatereadonlyCookieContainer cookieContainer;///<summary>///The random.///</summary>privatereadonlyRandom random;///<summary>///The thread status.///</summary>privatereadonlybool[] threadStatus;///<summary>///The threads.///</summary>privatereadonlyThread[] threads;#endregion#regionConstructors and Destructors///<summary>///Initializes a new instance of the<see cref="CrawlMaster"/>class.///</summary>///<param name="settings">///The settings.///</param>publicCrawlMaster(CrawlSettings settings)

{this.cookieContainer =newCookieContainer();this.random =newRandom();this.Settings =settings;this.threads =newThread[settings.ThreadCount];this.threadStatus =newbool[settings.ThreadCount];

}#endregion#regionPublic Events///<summary>///The add url event.///</summary>publiceventAddUrlEventHandler AddUrlEvent;///<summary>///The crawl error event.///</summary>publiceventCrawlErrorEventHandler CrawlErrorEvent;///<summary>///The data received event.///</summary>publiceventDataReceivedEventHandler DataReceivedEvent;#endregion#regionPublic Properties///<summary>///Gets the settings.///</summary>publicCrawlSettings Settings {get;privateset; }#endregion#regionPublic Methods and Operators///<summary>///The crawl.///</summary>publicvoidCrawl()

{this.Initialize();for(inti =0; i <this.threads.Length; i++)

{this.threads[i].Start(i);this.threadStatus[i] =false;

}

}///<summary>///The stop.///</summary>publicvoidStop()

{foreach(Thread threadinthis.threads)

{

thread.Abort();

}

}#endregion#regionMethods///<summary>///The config request.///</summary>///<param name="request">///The request.///</param>privatevoidConfigRequest(HttpWebRequest request)

{

request.UserAgent=this.Settings.UserAgent;

request.CookieContainer=this.cookieContainer;

request.AllowAutoRedirect=true;

request.MediaType="text/html";

request.Headers["Accept-Language"] ="zh-CN,zh;q=0.8";if(this.Settings.Timeout >0)

{

request.Timeout=this.Settings.Timeout;

}

}///<summary>///The crawl process.///</summary>///<param name="threadIndex">///The thread index.///</param>privatevoidCrawlProcess(objectthreadIndex)

{varcurrentThreadIndex = (int)threadIndex;while(true)

{//根据队列中的 Url 数量和空闲线程的数量,判断线程是睡眠还是退出if(UrlQueue.Instance.Count ==0)

{this.threadStatus[currentThreadIndex] =true;if(!this.threadStatus.Any(t => t ==false))

{

break;

}

Thread.Sleep(2000);continue;

}this.threadStatus[currentThreadIndex] =false;if(UrlQueue.Instance.Count ==0)

{continue;

}

UrlInfo urlInfo=UrlQueue.Instance.DeQueue();

HttpWebRequest request=null;

HttpWebResponse response=null;try{if(urlInfo ==null)

{continue;

}//1~5 秒随机间隔的自动限速if(this.Settings.AutoSpeedLimit)

{intspan =this.random.Next(1000,5000);

Thread.Sleep(span);

}//创建并配置Web请求request = WebRequest.Create(urlInfo.UrlString)asHttpWebRequest;this.ConfigRequest(request);if(request !=null)

{

response= request.GetResponse()asHttpWebResponse;

}if(response !=null)

{this.PersistenceCookie(response);

Stream stream=null;//如果页面压缩,则解压数据流if(response.ContentEncoding =="gzip")

{

Stream responseStream=response.GetResponseStream();if(responseStream !=null)

{

stream=newGZipStream(responseStream, CompressionMode.Decompress);

}

}else{

stream=response.GetResponseStream();

}using(stream)

{stringhtml =this.ParseContent(stream, response.CharacterSet);this.ParseLinks(urlInfo, html);if(this.DataReceivedEvent !=null)

{this.DataReceivedEvent(newDataReceivedEventArgs

{

Url=urlInfo.UrlString,

Depth=urlInfo.Depth,

Html=html

});

}if(stream !=null)

{

stream.Close();

}

}

}

}catch(Exception exception)

{if(this.CrawlErrorEvent !=null)

{if(urlInfo !=null)

{this.CrawlErrorEvent(newCrawlErrorEventArgs { Url = urlInfo.UrlString, Exception =exception });

}

}

}finally{if(request !=null)

{

request.Abort();

}if(response !=null)

{

response.Close();

}

}

}

}///<summary>///The initialize.///</summary>privatevoidInitialize()

{if(this.Settings.SeedsAddress !=null&&this.Settings.SeedsAddress.Count >0)

{foreach(stringseedinthis.Settings.SeedsAddress)

{if(Regex.IsMatch(seed, WebUrlRegularExpressions, RegexOptions.IgnoreCase))

{

UrlQueue.Instance.EnQueue(newUrlInfo(seed) { Depth =1});

}

}

}for(inti =0; i <this.Settings.ThreadCount; i++)

{varthreadStart =newParameterizedThreadStart(this.CrawlProcess);this.threads[i] =newThread(threadStart);

}

ServicePointManager.DefaultConnectionLimit=256;

}///<summary>///The is match regular.///</summary>///<param name="url">///The url.///</param>///<returns>///The<see cref="bool"/>.///</returns>privateboolIsMatchRegular(stringurl)

{boolresult =false;if(this.Settings.RegularFilterExpressions !=null&&this.Settings.RegularFilterExpressions.Count >0)

{if(this.Settings.RegularFilterExpressions.Any(

pattern=>Regex.IsMatch(url, pattern, RegexOptions.IgnoreCase)))

{

result=true;

}

}else{

result=true;

}returnresult;

}///<summary>///The parse content.///</summary>///<param name="stream">///The stream.///</param>///<param name="characterSet">///The character set.///</param>///<returns>///The<see cref="string"/>.///</returns>privatestringParseContent(Stream stream,stringcharacterSet)

{varmemoryStream =newMemoryStream();

stream.CopyTo(memoryStream);byte[] buffer =memoryStream.ToArray();

Encoding encode=Encoding.ASCII;stringhtml =encode.GetString(buffer);stringlocalCharacterSet =characterSet;

Match match= Regex.Match(html,"<meta([^<]*)charset=([^<]*)\"", RegexOptions.IgnoreCase);if(match.Success)

{

localCharacterSet= match.Groups[2].Value;varstringBuilder =newStringBuilder();foreach(chariteminlocalCharacterSet)

{if(item =='')

{

break;

}if(item !='\"')

{

stringBuilder.Append(item);

}

}

localCharacterSet=stringBuilder.ToString();

}if(string.IsNullOrEmpty(localCharacterSet))

{

localCharacterSet=characterSet;

}if(!string.IsNullOrEmpty(localCharacterSet))

{

encode=Encoding.GetEncoding(localCharacterSet);

}

memoryStream.Close();returnencode.GetString(buffer);

}///<summary>///The parse links.///</summary>///<param name="urlInfo">///The url info.///</param>///<param name="html">///The html.///</param>privatevoidParseLinks(UrlInfo urlInfo,stringhtml)

{if(this.Settings.Depth >0&& urlInfo.Depth >=this.Settings.Depth)

{return;

}varurlDictionary =newDictionary<string,string>();

Match match= Regex.Match(html,"(?i)<a .*?href=\"([^\"]+)\"[^>]*>(.*?)</a>");while(match.Success)

{//以 href 作为 keystringurlKey = match.Groups[1].Value;//以 text 作为 valuestringurlValue = Regex.Replace(match.Groups[2].Value,"(?i)<.*?>",string.Empty);

urlDictionary[urlKey]=urlValue;

match=match.NextMatch();

}foreach(variteminurlDictionary)

{stringhref =item.Key;stringtext =item.Value;if(!string.IsNullOrEmpty(href))

{boolcanBeAdd =true;if(this.Settings.EscapeLinks !=null&&this.Settings.EscapeLinks.Count >0)

{if(this.Settings.EscapeLinks.Any(suffix =>href.EndsWith(suffix, StringComparison.OrdinalIgnoreCase)))

{

canBeAdd=false;

}

}if(this.Settings.HrefKeywords !=null&&this.Settings.HrefKeywords.Count >0)

{if(!this.Settings.HrefKeywords.Any(href.Contains))

{

canBeAdd=false;

}

}if(canBeAdd)

{stringurl = href.Replace("%3f","?")

.Replace("%3d","=")

.Replace("%2f","/")

.Replace("&amp;","&");if(string.IsNullOrEmpty(url) || url.StartsWith("#")|| url.StartsWith("mailto:", StringComparison.OrdinalIgnoreCase)|| url.StartsWith("javascript:", StringComparison.OrdinalIgnoreCase))

{continue;

}varbaseUri =newUri(urlInfo.UrlString);

Uri currentUri= url.StartsWith("http", StringComparison.OrdinalIgnoreCase)?newUri(url)

:newUri(baseUri, url);

url=currentUri.AbsoluteUri;if(this.Settings.LockHost)

{//去除二级域名后,判断域名是否相等,相等则认为是同一个站点//例如:mail.pzcast.com 和 www.pzcast.comif(baseUri.Host.Split('.').Skip(1).Aggregate((a, b) => a +"."+b)!= currentUri.Host.Split('.').Skip(1).Aggregate((a, b) => a +"."+b))

{continue;

}

}if(!this.IsMatchRegular(url))

{continue;

}varaddUrlEventArgs =newAddUrlEventArgs { Title = text, Depth = urlInfo.Depth +1, Url =url };if(this.AddUrlEvent !=null&& !this.AddUrlEvent(addUrlEventArgs))

{continue;

}

UrlQueue.Instance.EnQueue(newUrlInfo(url) { Depth = urlInfo.Depth +1});

}

}

}

}///<summary>///The persistence cookie.///</summary>///<param name="response">///The response.///</param>privatevoidPersistenceCookie(HttpWebResponse response)

{if(!this.Settings.KeepCookie)

{return;

}stringcookies = response.Headers["Set-Cookie"];if(!string.IsNullOrEmpty(cookies))

{varcookieUri =newUri(string.Format("{0}://{1}:{2}/",

response.ResponseUri.Scheme,

response.ResponseUri.Host,

response.ResponseUri.Port));this.cookieContainer.SetCookies(cookieUri, cookies);

}

}#endregion}

}

CrawlMaster.cs

namespaceFeng.SimpleCrawler

{usingSystem;usingSystem.Collections.Generic;///<summary>///The crawl settings.///</summary>[Serializable]publicclassCrawlSettings

{#regionFields///<summary>///The depth.///</summary>privatebytedepth =3;///<summary>///The lock host.///</summary>privateboollockHost =true;///<summary>///The thread count.///</summary>privatebytethreadCount =1;///<summary>///The timeout.///</summary>privateinttimeout =15000;///<summary>///The user agent.///</summary>privatestringuserAgent ="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko)Chrome/23.0.1271.97 Safari/537.11";#endregion#regionConstructors and Destructors///<summary>///Initializes a new instance of the<see cref="CrawlSettings"/>class.///</summary>publicCrawlSettings()

{this.AutoSpeedLimit =false;this.EscapeLinks =newList<string>();this.KeepCookie =true;this.HrefKeywords =newList<string>();this.LockHost =true;this.RegularFilterExpressions =newList<string>();this.SeedsAddress =newList<string>();

}#endregion#regionPublic Properties///<summary>///Gets or sets a value indicating whether auto speed limit.///</summary>publicboolAutoSpeedLimit {get;set; }///<summary>///Gets or sets the depth.///</summary>publicbyteDepth

{get{returnthis.depth;

}set{this.depth =value;

}

}///<summary>///Gets the escape links.///</summary>publicList<string> EscapeLinks {get;privateset; }///<summary>///Gets or sets a value indicating whether keep cookie.///</summary>publicboolKeepCookie {get;set; }///<summary>///Gets the href keywords.///</summary>publicList<string> HrefKeywords {get;privateset; }///<summary>///Gets or sets a value indicating whether lock host.///</summary>publicboolLockHost

{get{returnthis.lockHost;

}set{this.lockHost =value;

}

}///<summary>///Gets the regular filter expressions.///</summary>publicList<string> RegularFilterExpressions {get;privateset; }///<summary>///Gets the seeds address.///</summary>publicList<string> SeedsAddress {get;privateset; }///<summary>///Gets or sets the thread count.///</summary>publicbyteThreadCount

{get{returnthis.threadCount;

}set{this.threadCount =value;

}

}///<summary>///Gets or sets the timeout.///</summary>publicintTimeout

{get{returnthis.timeout;

}set{this.timeout =value;

}

}///<summary>///Gets or sets the user agent.///</summary>publicstringUserAgent

{get{returnthis.userAgent;

}set{this.userAgent =value;

}

}#endregion}

}

CrawlSettings.cs

namespaceFeng.SimpleCrawler

{///<summary>///The crawl status.///</summary>publicenumCrawlStatus

{///<summary>///The completed.///</summary>Completed =1,///<summary>///The never been.///</summary>NeverBeen =2}

}

CrawlStatus.cs

namespaceFeng.SimpleCrawler

{usingSystem;///<summary>///The data received event handler.///</summary>///<param name="args">///The args.///</param>publicdelegatevoidDataReceivedEventHandler(DataReceivedEventArgs args);///<summary>///The data received event args.///</summary>publicclassDataReceivedEventArgs : EventArgs

{#regionPublic Properties///<summary>///Gets or sets the depth.///</summary>publicintDepth {get;set; }///<summary>///Gets or sets the html.///</summary>publicstringHtml {get;set; }///<summary>///Gets or sets the url.///</summary>publicstringUrl {get;set; }#endregion}

}

DataReceivedEventArgs.cs

namespaceFeng.SimpleCrawler

{usingSystem.Collections.Generic;usingSystem.Threading;///<summary>///The security queue.///</summary>///<typeparam name="T">///Any type.///</typeparam>publicabstractclassSecurityQueue<T>whereT :class{#regionFields///<summary>///The inner queue.///</summary>protectedreadonlyQueue<T> InnerQueue =newQueue<T>();///<summary>///The sync object.///</summary>protectedreadonlyobjectSyncObject =newobject();///<summary>///The auto reset event.///</summary>privatereadonlyAutoResetEvent autoResetEvent;#endregion#regionConstructors and Destructors///<summary>///Initializes a new instance of the<see cref="SecurityQueue{T}"/>class.///</summary>protectedSecurityQueue()

{this.autoResetEvent =newAutoResetEvent(false);

}#endregion#regionDelegates///<summary>///The before en queue event handler.///</summary>///<param name="target">///The target.///</param>///<returns>///The<see cref="bool"/>.///</returns>publicdelegateboolBeforeEnQueueEventHandler(T target);#endregion#regionPublic Events///<summary>///The before en queue event.///</summary>publiceventBeforeEnQueueEventHandler BeforeEnQueueEvent;#endregion#regionPublic Properties///<summary>///Gets the auto reset event.///</summary>publicAutoResetEvent AutoResetEvent

{get{returnthis.autoResetEvent;

}

}///<summary>///Gets the count.///</summary>publicintCount

{get{lock(this.SyncObject)

{returnthis.InnerQueue.Count;

}

}

}///<summary>///Gets a value indicating whether has value.///</summary>publicboolHasValue

{get{returnthis.Count !=0;

}

}#endregion#regionPublic Methods and Operators///<summary>///The de queue.///</summary>///<returns>///The<see cref="T"/>.///</returns>publicT DeQueue()

{lock(this.SyncObject)

{if(this.InnerQueue.Count >0)

{returnthis.InnerQueue.Dequeue();

}returndefault(T);

}

}///<summary>///The en queue.///</summary>///<param name="target">///The target.///</param>publicvoidEnQueue(T target)

{lock(this.SyncObject)

{if(this.BeforeEnQueueEvent !=null)

{if(this.BeforeEnQueueEvent(target))

{this.InnerQueue.Enqueue(target);

}

}else{this.InnerQueue.Enqueue(target);

}this.AutoResetEvent.Set();

}

}#endregion}

}

SecurityQueue.cs

namespaceFeng.SimpleCrawler

{///<summary>///The url info.///</summary>publicclassUrlInfo

{#regionFields///<summary>///The url.///</summary>privatereadonlystringurl;#endregion#regionConstructors and Destructors///<summary>///Initializes a new instance of the<see cref="UrlInfo"/>class.///</summary>///<param name="urlString">///The url string.///</param>publicUrlInfo(stringurlString)

{this.url =urlString;

}#endregion#regionPublic Properties///<summary>///Gets or sets the depth.///</summary>publicintDepth {get;set; }///<summary>///Gets the url string.///</summary>publicstringUrlString

{get{returnthis.url;

}

}///<summary>///Gets or sets the status.///</summary>publicCrawlStatus Status {get;set; }#endregion}

}

UrlInfo.cs

namespaceFeng.SimpleCrawler

{///<summary>///The url queue.///</summary>publicclassUrlQueue : SecurityQueue<UrlInfo>{#regionConstructors and Destructors///<summary>///Prevents a default instance of the<see cref="UrlQueue"/>class from being created.///</summary>privateUrlQueue()

{

}#endregion#regionPublic Properties///<summary>///Gets the instance.///</summary>publicstaticUrlQueue Instance

{get{returnNested.Inner;

}

}#endregion///<summary>///The nested.///</summary>privatestaticclassNested

{#regionStatic Fields///<summary>///The inner.///</summary>internalstaticreadonlyUrlQueue Inner =newUrlQueue();#endregion}

}

}

UrlQueue.cs

5.创建windows服务.这些工作都准备完成后,终于要来我们的重点了,我们都知道控制台程序非常不稳定,而我们的这个从博客园上面爬取文章的这个事情需要长期的进行下去,这个需要 很稳定的进行下去,所以我想起了windows服务,创建好我们的windows服务,代码如下。

usingFeng.SimpleCrawler;usingFeng.DbHelper;usingFeng.Log;usingHtmlAgilityPack;namespaceFeng.Demo

{///<summary>///windows服务///</summary>partialclassFengCnblogsService : ServiceBase

{#region构造函数///<summary>///构造函数///</summary>publicFengCnblogsService()

{

InitializeComponent();

}#endregion#region字段属性///<summary>///蜘蛛爬虫的设置///</summary>privatestaticreadonlyCrawlSettings Settings =newCrawlSettings();///<summary>///临时内存表存储数据///</summary>privatestaticDataTable dt =newDataTable();///<summary>///关于 Filter URL:http://www.cnblogs.com/heaad/archive/2011/01/02/1924195.html///</summary>privatestaticBloomFilter<string>filter;#endregion#region启动服务///<summary>///TODO: 在此处添加代码以启动服务。///</summary>///<param name="args"></param>protectedoverridevoidOnStart(string[] args)

{

ProcessStart();

}#endregion#region停止服务///<summary>///TODO: 在此处添加代码以执行停止服务所需的关闭操作。///</summary>protectedoverridevoidOnStop()

{

}#endregion#region程序开始处理///<summary>///程序开始处理///</summary>privatevoidProcessStart()

{

dt.Columns.Add("BlogTitle",typeof(string));

dt.Columns.Add("BlogUrl",typeof(string));

dt.Columns.Add("BlogAuthor",typeof(string));

dt.Columns.Add("BlogTime",typeof(string));

dt.Columns.Add("BlogMotto",typeof(string));

dt.Columns.Add("BlogDepth",typeof(string));

filter=newBloomFilter<string>(200000);conststringCityName ="";#region设置种子地址//设置种子地址Settings.SeedsAddress.Add(string.Format("http://www.cnblogs.com/{0}", CityName));

Settings.SeedsAddress.Add("http://www.cnblogs.com/artech");

Settings.SeedsAddress.Add("http://www.cnblogs.com/wuhuacong/");

Settings.SeedsAddress.Add("http://www.cnblogs.com/dudu/");

Settings.SeedsAddress.Add("http://www.cnblogs.com/guomingfeng/");

Settings.SeedsAddress.Add("http://www.cnblogs.com/daxnet/");

Settings.SeedsAddress.Add("http://www.cnblogs.com/fenglingyi");

Settings.SeedsAddress.Add("http://www.cnblogs.com/ahthw/");

Settings.SeedsAddress.Add("http://www.cnblogs.com/wangweimutou/");#endregion#region设置 URL 关键字Settings.HrefKeywords.Add("a/");

Settings.HrefKeywords.Add("b/");

Settings.HrefKeywords.Add("c/");

Settings.HrefKeywords.Add("d/");

Settings.HrefKeywords.Add("e/");

Settings.HrefKeywords.Add("f/");

Settings.HrefKeywords.Add("g/");

Settings.HrefKeywords.Add("h/");

Settings.HrefKeywords.Add("i/");

Settings.HrefKeywords.Add("j/");

Settings.HrefKeywords.Add("k/");

Settings.HrefKeywords.Add("l/");

Settings.HrefKeywords.Add("m/");

Settings.HrefKeywords.Add("n/");

Settings.HrefKeywords.Add("o/");

Settings.HrefKeywords.Add("p/");

Settings.HrefKeywords.Add("q/");

Settings.HrefKeywords.Add("r/");

Settings.HrefKeywords.Add("s/");

Settings.HrefKeywords.Add("t/");

Settings.HrefKeywords.Add("u/");

Settings.HrefKeywords.Add("v/");

Settings.HrefKeywords.Add("w/");

Settings.HrefKeywords.Add("x/");

Settings.HrefKeywords.Add("y/");

Settings.HrefKeywords.Add("z/");#endregion//设置爬取线程个数Settings.ThreadCount =1;//设置爬取深度Settings.Depth =55;//设置爬取时忽略的 Link,通过后缀名的方式,可以添加多个Settings.EscapeLinks.Add("http://www.oschina.net/");//设置自动限速,1~5 秒随机间隔的自动限速Settings.AutoSpeedLimit =false;//设置都是锁定域名,去除二级域名后,判断域名是否相等,相等则认为是同一个站点Settings.LockHost =false;

Settings.RegularFilterExpressions.Add(@"http://([w]{3}.)+[cnblogs]+.com/");varmaster =newCrawlMaster(Settings);

master.AddUrlEvent+=MasterAddUrlEvent;

master.DataReceivedEvent+=MasterDataReceivedEvent;

master.Crawl();

}#endregion#region打印Url///<summary>///The master add url event.///</summary>///<param name="args">///The args.///</param>///<returns>///The<see cref="bool"/>.///</returns>privatestaticboolMasterAddUrlEvent(AddUrlEventArgs args)

{if(!filter.Contains(args.Url))

{

filter.Add(args.Url);

Console.WriteLine(args.Url);if(dt.Rows.Count >200)

{

MssqlHelper.InsertDb(dt);

dt.Rows.Clear();

}returntrue;

}returnfalse;//返回 false 代表:不添加到队列中}#endregion#region解析HTML///<summary>///The master data received event.///</summary>///<param name="args">///The args.///</param>privatestaticvoidMasterDataReceivedEvent(SimpleCrawler.DataReceivedEventArgs args)

{//在此处解析页面,可以用类似于 HtmlAgilityPack(页面解析组件)的东东、也可以用正则表达式、还可以自己进行字符串分析HtmlDocument doc=newHtmlDocument();

doc.LoaDHTML(args.Html);

HtmlNode node= doc.DocumentNode.SelectSingleNode("//title");stringtitle =node.InnerText;

HtmlNode node2= doc.DocumentNode.SelectSingleNode("//*[@id='post-date']");stringtime =node2.InnerText;

HtmlNode node3= doc.DocumentNode.SelectSingleNode("//*[@id='topics']/div/div[3]/a[1]");stringauthor =node3.InnerText;

HtmlNode node6= doc.DocumentNode.SelectSingleNode("//*[@id='blogTitle']/h2");stringmotto =node6.InnerText;

MssqlHelper.GetData(title, args.Url, author, time, motto, args.Depth.ToString(), dt);

LogHelper.WriteLog(title);

LogHelper.WriteLog(args.Url);

LogHelper.WriteLog(author);

LogHelper.WriteLog(time);

LogHelper.WriteLog(motto==""?"null": motto);

LogHelper.WriteLog(args.Depth+"&dt.Rows.Count="+dt.Rows.Count);//每次超过100条数据就存入数据库,可以根据自己的情况设置数量if(dt.Rows.Count >100)

{

MssqlHelper.InsertDb(dt);

dt.Rows.Clear();

}

}#endregion}

}

这里我们用爬虫从博客园爬取来了博文,我们需要用这个HtmlAgilityPack第三方工具来解析出我们需要的字段,博文标题,博文作者,博文URL,等等一些信息。同时我们可以设置服务的一些信息

在网络爬虫中,我们要设置一些参数,设置种子地址,URL关键字,还有爬取的深度等等,这些工作都完成后,我们就只需要安装我们的windows服务,就大功告成了。嘿嘿...

6.0安装windows服务在这里我们采用vs自带的工具来安装windows服务。

安装成功后,打开我们的windows服务就可以看到我们安装的windows服务。

同时可以查看我们的日志文件,查看我们爬取的博文解析出来的信息。如下图所示。

这个时候去查看我们的数据库,我的这个服务已经运行了一天。。。

如果你觉得本文不错的话,帮我推荐一下,本人能力有限,文中如有不妥之处,欢迎拍砖,如果需要源码的童鞋,可以留下你的邮箱...

 
 
 
免责声明:本文为网络用户发布,其观点仅代表作者个人观点,与本站无关,本站仅提供信息存储服务。文中陈述内容未经本站证实,其真实性、完整性、及时性本站不作任何保证或承诺,请读者仅作参考,并请自行核实相关内容。
2023年上半年GDP全球前十五强
 百态   2023-10-24
美众议院议长启动对拜登的弹劾调查
 百态   2023-09-13
上海、济南、武汉等多地出现不明坠落物
 探索   2023-09-06
印度或要将国名改为“巴拉特”
 百态   2023-09-06
男子为女友送行,买票不登机被捕
 百态   2023-08-20
手机地震预警功能怎么开?
 干货   2023-08-06
女子4年卖2套房花700多万做美容:不但没变美脸,面部还出现变形
 百态   2023-08-04
住户一楼被水淹 还冲来8头猪
 百态   2023-07-31
女子体内爬出大量瓜子状活虫
 百态   2023-07-25
地球连续35年收到神秘规律性信号,网友:不要回答!
 探索   2023-07-21
全球镓价格本周大涨27%
 探索   2023-07-09
钱都流向了那些不缺钱的人,苦都留给了能吃苦的人
 探索   2023-07-02
倩女手游刀客魅者强控制(强混乱强眩晕强睡眠)和对应控制抗性的关系
 百态   2020-08-20
美国5月9日最新疫情:美国确诊人数突破131万
 百态   2020-05-09
荷兰政府宣布将集体辞职
 干货   2020-04-30
倩女幽魂手游师徒任务情义春秋猜成语答案逍遥观:鹏程万里
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案神机营:射石饮羽
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案昆仑山:拔刀相助
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案天工阁:鬼斧神工
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案丝路古道:单枪匹马
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:与虎谋皮
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:李代桃僵
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:指鹿为马
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案金陵:小鸟依人
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案金陵:千金买邻
 干货   2019-11-12
 
推荐阅读
 
 
 
>>返回首頁<<
 
靜靜地坐在廢墟上,四周的荒凉一望無際,忽然覺得,淒涼也很美
© 2005- 王朝網路 版權所有