SpringBoot和VW-Crawler抓取csdn的文章
一:工程介紹
使用Springboot做架構(gòu),redis做數(shù)據(jù)存儲(chǔ),vw-crawler做爬蟲(chóng)模塊抓取csdn的文章,pom配置如下:
<parent> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-parent</artifactId> <version>1.4.3.RELEASE</version> <relativePath/> </parent> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding> <java.version>1.8</java.version> <vw-crawler.version>0.0.4</vw-crawler.version> </properties> <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> <exclusions> <exclusion> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-tomcat</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>com.github.vector4wang</groupId> <artifactId>vw-crawler</artifactId> <version>0.0.5</version> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-redis</artifactId> </dependency> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>1.2.31</version> </dependency> </dependencies>
redis相關(guān)配置
# Redis數(shù)據(jù)庫(kù)索引(默認(rèn)為0)
spring.redis.database=0
# Redis服務(wù)器地址
spring.redis.host=localhost
# Redis服務(wù)器連接端口
spring.redis.port=6379
# Redis服務(wù)器連接密碼(默認(rèn)為空)
spring.redis.password=
# 連接池最大連接數(shù)(使用負(fù)值表示沒(méi)有限制)
spring.redis.pool.max-active=8
# 連接池最大阻塞等待時(shí)間(使用負(fù)值表示沒(méi)有限制)
spring.redis.pool.max-wait=-1
# 連接池中的最大空閑連接
spring.redis.pool.max-idle=8
# 連接池中的最小空閑連接
spring.redis.pool.min-idle=0
# 連接超時(shí)時(shí)間(毫秒)
spring.redis.timeout=0
redis操作封裝
@Component public class DataCache { @Autowired private StringRedisTemplate redisTemplate; /** * url作為key保存到redis */ public void save(Blog blog) { redisTemplate.opsForValue().set(blog.getUrlMd5(), JSON.toJSONString(blog)); } /** * 根據(jù)url獲取 */ public Blog get(String url) { String md5Url = Md5Util.getMD5(url.getBytes()); String blogStr = redisTemplate.opsForValue().get(md5Url); if (StringUtils.isEmpty(blogStr)) { return new Blog(); } return JSON.parseObject(blogStr, Blog.class); } }
二:爬蟲(chóng)操作
1:頁(yè)面模型,cssSelector定位數(shù)據(jù)
public class Blog implements Serializable { @CssSelector(selector = "#mainBox > main > div.blog-content-box > div.article-header-box > div > div.article-info-box > div > span.time", dateFormat = "yyyy年MM月dd日 HH:mm:ss") private Date publishDate; @CssSelector(selector = "main > div.blog-content-box > div.article-header-box > div.article-header>div.article-title-box > h1", resultType = SelectType.TEXT) private String title; @CssSelector(selector = "main > div.blog-content-box > div.article-header-box > div.article-header>div.article-info-box > div > div > span.read-count", resultType = SelectType.TEXT) private String readCountStr; private int readCount; @CssSelector(selector = "#article_content",resultType = SelectType.TEXT) private String content; @CssSelector(selector = "body > div.tool-box > ul > li:nth-child(1) > button > p",resultType = SelectType.TEXT) private int likeCount; /** * 暫時(shí)不支持自動(dòng)解析列表的功能,所以加個(gè)中間變量,需要二次解析下 */ @CssSelector(selector = "#mainBox > main > div.comment-box > div.comment-list-container > div.comment-list-box",resultType = SelectType.HTML) private String comentTmp; private String url; private String urlMd5; private List<String> comment;
2:配置請(qǐng)求頭、爬蟲(chóng)線(xiàn)程數(shù)、超時(shí)時(shí)間等等,并實(shí)現(xiàn)CrawlerService爬蟲(chóng)接口
@Component @Order public class Crawler implements CommandLineRunner { @Autowired private DataCache dataCache; @Override public void run(String... strs) { new VWCrawler.Builder().setUrl("https://blog.csdn.net/qqHJQS").setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36") .setTargetUrlRex("https://blog.csdn.net/zhang__l/article/details/[0-9]+") .setThreadCount(5) .setTimeOut(5000).setPageParser(new CrawlerService<Blog>() { @Override public void parsePage(Document doc, Blog pageObj) { pageObj.setReadCount(Integer.parseInt(pageObj.getReadCountStr().replace("閱讀數(shù):", "").replace("", "0"))); pageObj.setUrl(doc.baseUri()); pageObj.setUrlMd5(Md5Util.getMD5(pageObj.getUrl().getBytes())); /** * TODO 評(píng)論列表還未處理 */ } @Override public void save(Blog pageObj) { dataCache.save(pageObj); } }).build().start(); } }
3:?jiǎn)?dòng)執(zhí)行
右鍵執(zhí)行CrawlerApplication

入群二維碼(失效請(qǐng)加:13128600812,備注:ljs) 

posted on 2019-05-14 15:42 張亮13128600812 閱讀(221) 評(píng)論(0) 收藏 舉報(bào)
浙公網(wǎng)安備 33010602011771號(hào)