<output id="qn6qe"></output>

    1. <output id="qn6qe"><tt id="qn6qe"></tt></output>
    2. <strike id="qn6qe"></strike>

      亚洲 日本 欧洲 欧美 视频,日韩中文字幕有码av,一本一道av中文字幕无码,国产线播放免费人成视频播放,人妻少妇偷人无码视频,日夜啪啪一区二区三区,国产尤物精品自在拍视频首页,久热这里只有精品12

      Java、C#雙語版HttpHelper類(解決網(wǎng)頁抓取亂碼問題)

      在做一些需要抓取網(wǎng)頁的項目時,經(jīng)常性的遇到亂碼問題。最省事的做法是去需要抓取的網(wǎng)站看看具體是什么編碼,然后采用正確的編碼進行解碼就OK了,不過總是一個個頁面親自去判斷也不是個事兒,尤其是你需要大量抓取不同站點的頁面時,比如網(wǎng)頁爬蟲類的程序,這時我們需要做一個相對比較通用的程序,進行頁面編碼的正確識別。

      亂碼問題基本上都是編碼不一致導致的,比如網(wǎng)頁編碼使用的是UTF-8,你使用GB2312去讀取,肯定會亂碼。知道了本質(zhì)問題后剩下的就是如何判斷網(wǎng)頁編碼了。GBK、GB2312、UTF-8、BIG-5,一般來說遇到的中文網(wǎng)頁編碼大多是這幾種,簡化下就是只有 GBK和UTF-8兩種,不夸張的說,現(xiàn)在的網(wǎng)站要么是GBK編碼,要么是UTF-8編碼,所以接下來的問題就是判斷站點具體是UTF-8的還是GBK的。

      那怎么判斷頁面具體編碼呢?首先查看響應頭的 Content-Type,若響應頭里找不到,再去網(wǎng)頁里查找meta頭,若還是找不到,那沒辦法了,設(shè)置個默認編碼吧,個人推薦設(shè)置成UTF-8。比如訪問博客園首頁http://www.rzrgm.cn/,可以在響應頭里看到 Content-Type: text/html; charset=utf-8,這樣我們就知道博客園是采用utf-8編碼,但并不是所有的網(wǎng)站都會在響應頭Content-Type加上頁面編碼,比如百度的就是Content-Type: text/html,找不到charset,這時只能去網(wǎng)頁里面找<meta http-equiv=Content-Type content="text/html;charset=utf-8">,確認網(wǎng)頁最終編碼,總結(jié)下就是下面幾步

      1. 1.響應頭查找Content-Type中的charset,若找到了charset則跳過步驟2,3,直接進行第4步
      2. 2.若步驟1得不到charset,則先讀取網(wǎng)頁內(nèi)容,解析meta里面的charset得到頁面編碼
      3. 3.若步驟2種還是沒有得到頁面編碼,那沒辦法了設(shè)置默認編碼為UTF-8
      4. 4.使用得到的charset重新讀取響應流

      通過上面方法基本上能正確解析絕大多數(shù)頁面,實在不能識別的只好親自去核實下具體編碼了

      注意:

      1. 1.現(xiàn)在站點幾乎都啟用了gzip壓縮支持,所以在請求頭里面加上Accept-Encoding:gzip,deflate,這樣站點會返回壓縮流,能顯著的提高請求效率
      2. 2.由于網(wǎng)絡(luò)流不支持流查找操作,也就是只能讀取一次,為了提高效率,所以這里建議將http響應流先讀取到內(nèi)存中,以方便進行二次解碼,沒有必要重新請求去重新獲取響應流

      下面分別給出Java和C#版的實現(xiàn)代碼,頁面底部給出了源碼的git鏈接,有需要的童鞋請自行下載

      Java實現(xiàn)

      package com.cnblogs.lzrabbit.util;
      
      import java.io.*;
      import java.net.*;
      import java.util.*;
      import java.util.Map.Entry;
      import java.util.regex.*;
      import java.util.zip.*;
      
      public class HttpUtil {
      
          public static String sendGet(String url) throws Exception {
              return send(url, "GET", null, null);
          }
      
          public static String sendPost(String url, String param) throws Exception {
              return send(url, "POST", param, null);
          }
      
          public static String send(String url, String method, String param, Map<String, String> headers) throws Exception {
              String result = null;
              HttpURLConnection conn = getConnection(url, method, param, headers);
              String charset = conn.getHeaderField("Content-Type");
              charset = detectCharset(charset);
              InputStream input = getInputStream(conn);
              ByteArrayOutputStream output = new ByteArrayOutputStream();
              int count;
              byte[] buffer = new byte[4096];
              while ((count = input.read(buffer, 0, buffer.length)) > 0) {
                  output.write(buffer, 0, count);
              }
              input.close();
              // 若已通過請求頭得到charset,則不需要去html里面繼續(xù)查找
              if (charset == null || charset.equals("")) {
                  charset = detectCharset(output.toString());
                  // 若在html里面還是未找到charset,則設(shè)置默認編碼為utf-8
                  if (charset == null || charset.equals("")) {
                      charset = "utf-8";
                  }
              }
              
              result = output.toString(charset);
              output.close();
      
              // result = output.toString(charset);
              // BufferedReader bufferReader = new BufferedReader(new
              // InputStreamReader(input, charset));
              // String line;
              // while ((line = bufferReader.readLine()) != null) {
              // if (result == null)
              // bufferReader.mark(1);
              // result += line;
              // }
              // bufferReader.close();
      
              return result;
          }
      
          private static String detectCharset(String input) {
              Pattern pattern = Pattern.compile("charset=\"?([\\w\\d-]+)\"?;?", Pattern.CASE_INSENSITIVE);
              if (input != null && !input.equals("")) {
                  Matcher matcher = pattern.matcher(input);
                  if (matcher.find()) {
                      return matcher.group(1);
                  }
              }
              return null;
          }
      
          private static InputStream getInputStream(HttpURLConnection conn) throws Exception {
              String ContentEncoding = conn.getHeaderField("Content-Encoding");
              if (ContentEncoding != null) {
                  ContentEncoding = ContentEncoding.toLowerCase();
                  if (ContentEncoding.indexOf("gzip") != 1)
                      return new GZIPInputStream(conn.getInputStream());
                  else if (ContentEncoding.indexOf("deflate") != 1)
                      return new DeflaterInputStream(conn.getInputStream());
              }
      
              return conn.getInputStream();
          }
      
          static HttpURLConnection getConnection(String url, String method, String param, Map<String, String> header) throws Exception {
              HttpURLConnection conn = (HttpURLConnection) (new URL(url)).openConnection();
              conn.setRequestMethod(method);
      
              // 設(shè)置通用的請求屬性
              conn.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
              conn.setRequestProperty("Connection", "keep-alive");
              conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari/537.36");
              conn.setRequestProperty("Accept-Encoding", "gzip,deflate");
      
              String ContentEncoding = null;
              if (header != null) {
                  for (Entry<String, String> entry : header.entrySet()) {
                      if (entry.getKey().equalsIgnoreCase("Content-Encoding"))
                          ContentEncoding = entry.getValue();
                      conn.setRequestProperty(entry.getKey(), entry.getValue());
                  }
              }
      
              if (method == "POST") {
                  conn.setDoOutput(true);
                  conn.setDoInput(true);
                  if (param != null && !param.equals("")) {
                      OutputStream output = conn.getOutputStream();
                      if (ContentEncoding != null) {
                          if (ContentEncoding.indexOf("gzip") > 0) {
                              output=new GZIPOutputStream(output);
                          }
                          else if(ContentEncoding.indexOf("deflate") > 0) {
                              output=new DeflaterOutputStream(output);
                          }
                      }
                      output.write(param.getBytes());
                  }
              }
              // 建立實際的連接
              conn.connect();
              return conn;
          }
      }

      C#實現(xiàn)

      using System;
      using System.Collections;
      using System.IO;
      using System.Linq;
      using System.Net;
      using System.Net.Security;
      using System.Security.Cryptography.X509Certificates;
      using System.Text;
      using System.Text.RegularExpressions;
      using System.Web;
      using System.IO.Compression;
      using System.Collections.Generic;
      using System.Collections.Specialized;
      
      namespace CSharp.Util.Net
      {
          public class HttpHelper
          {
              private static bool RemoteCertificateValidate(object sender, X509Certificate certificate, X509Chain chain, SslPolicyErrors errors)
              {
                  //用戶https請求
                  return true; //總是接受
              }
      
              public static string SendPost(string url, string data)
              {
                  return Send(url, "POST", data, null);
              }
      
              public static string SendGet(string url)
              {
                  return Send(url, "GET", null, null);
              }
      
              public static string Send(string url, string method, string data, HttpConfig config)
              {
                  if (config == null) config = new HttpConfig();
                  string result;
                  using (HttpWebResponse response = GetResponse(url, method, data, config))
                  {
                      Stream stream = response.GetResponseStream();
                     
                      if (!String.IsNullOrEmpty(response.ContentEncoding))
                      {
                          if (response.ContentEncoding.Contains("gzip"))
                          {
                              stream = new GZipStream(stream, CompressionMode.Decompress);
                          }
                          else if (response.ContentEncoding.Contains("deflate"))
                          {
                              stream = new DeflateStream(stream, CompressionMode.Decompress);
                          }
                      }
                    
                      byte[] bytes = null;
                      using (MemoryStream ms = new MemoryStream())
                      {
                          int count;
                          byte[] buffer = new byte[4096];
                          while ((count = stream.Read(buffer, 0, buffer.Length)) > 0)
                          {
                              ms.Write(buffer, 0, count);
                          }
                          bytes = ms.ToArray();
                      }
      
                      #region 檢測流編碼
                      Encoding encoding;
      
                      //檢測響應頭是否返回了編碼類型,若返回了編碼類型則使用返回的編碼
                      //注:有時響應頭沒有編碼類型,CharacterSet經(jīng)常設(shè)置為ISO-8859-1
                      if (!string.IsNullOrEmpty(response.CharacterSet) && response.CharacterSet.ToUpper() != "ISO-8859-1")
                      {
                          encoding = Encoding.GetEncoding(response.CharacterSet == "utf8" ? "utf-8" : response.CharacterSet);
                      }
                      else
                      {
                          //若沒有在響應頭找到編碼,則去html找meta頭的charset
                          result = Encoding.Default.GetString(bytes);
                          //在返回的html里使用正則匹配頁面編碼
                          Match match = Regex.Match(result, @"<meta.*charset=""?([\w-]+)""?.*>", RegexOptions.IgnoreCase);
                          if (match.Success)
                          {
                              encoding = Encoding.GetEncoding(match.Groups[1].Value);
                          }
                          else
                          {
                              //若html里面也找不到編碼,默認使用utf-8
                              encoding = Encoding.GetEncoding(config.CharacterSet);
                          }
                      }
                      #endregion
      
                      result = encoding.GetString(bytes);
                  }
                  return result;
              }
      
              private static HttpWebResponse GetResponse(string url, string method, string data, HttpConfig config)
              {
                  HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
                  request.Method = method;
                  request.Referer = config.Referer;
                  //有些頁面不設(shè)置用戶代理信息則會抓取不到內(nèi)容
                  request.UserAgent = config.UserAgent;
                  request.Timeout = config.Timeout;
                  request.Accept = config.Accept;
                  request.Headers.Set("Accept-Encoding", config.AcceptEncoding);
                  request.ContentType = config.ContentType;
                  request.KeepAlive = config.KeepAlive;
      
                  if (url.ToLower().StartsWith("https"))
                  {
                      //這里加入解決生產(chǎn)環(huán)境訪問https的問題--Could not establish trust relationship for the SSL/TLS secure channel
                      ServicePointManager.ServerCertificateValidationCallback = new RemoteCertificateValidationCallback(RemoteCertificateValidate);
                  }
      
      
                  if (method.ToUpper() == "POST")
                  {
                      if (!string.IsNullOrEmpty(data))
                      {
                          byte[] bytes = Encoding.UTF8.GetBytes(data);
      
                          if (config.GZipCompress)
                          {
                              using (MemoryStream stream = new MemoryStream())
                              {
                                  using (GZipStream gZipStream = new GZipStream(stream, CompressionMode.Compress))
                                  {
                                      gZipStream.Write(bytes, 0, bytes.Length);
                                  }
                                  bytes = stream.ToArray();
                              }
                          }
      
                          request.ContentLength = bytes.Length;
                          request.GetRequestStream().Write(bytes, 0, bytes.Length);
                      }
                      else
                      {
                          request.ContentLength = 0;
                      }
                  }
      
                  return (HttpWebResponse)request.GetResponse();
              }      
          }
      
          public class HttpConfig
          {
              public string Referer { get; set; }
      
              /// <summary>
              /// 默認(text/html)
              /// </summary>
              public string ContentType { get; set; }
      
              public string Accept { get; set; }
      
              public string AcceptEncoding { get; set; }
      
              /// <summary>
              /// 超時時間(毫秒)默認100000
              /// </summary>
              public int Timeout { get; set; }
      
              public string UserAgent { get; set; }
      
              /// <summary>
              /// POST請求時,數(shù)據(jù)是否進行g(shù)zip壓縮
              /// </summary>
              public bool GZipCompress { get; set; }
      
              public bool KeepAlive { get; set; }
      
              public string CharacterSet { get; set; }
      
              public HttpConfig()
              {
                  this.Timeout = 100000;
                  this.ContentType = "text/html; charset=" + Encoding.UTF8.WebName;
                  this.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari/537.36";
                  this.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
                  this.AcceptEncoding = "gzip,deflate";
                  this.GZipCompress = false;
                  this.KeepAlive = true;
                  this.CharacterSet = "UTF-8";
              }
          }
      }

      HttpUtil.java

      HttpHelper.cs

      posted @ 2014-03-02 18:00  懶惰的肥兔  閱讀(8828)  評論(17)    收藏  舉報
      主站蜘蛛池模板: 欧美牲交a欧美在线| 欧美老少配性行为| 日本亚洲欧洲无免费码在线| 色欲国产精品一区成人精品| 精品亚洲综合一区二区三区| 欧美视频二区欧美影视| 日韩免费无码视频一区二区三区| 九寨沟县| 亚洲精品人妻中文字幕| 欧美人妻在线一区二区| 欧美最猛黑人xxxx| 国产成人永久免费av在线| 十八禁日本一区二区三区| 国产国语一级毛片| 日韩av一区二区三区不卡| 色综合五月伊人六月丁香| bt天堂新版中文在线| 精品无码久久久久久尤物| 亚洲AV无码破坏版在线观看| 国产精品日韩中文字幕熟女| 久久精品国产蜜臀av| 日本亚洲欧洲无免费码在线| 久久这里有精品国产电影网| 欧美叉叉叉bbb网站| 色综合久久天天综线观看| 国产国拍亚洲精品永久软件| 久久狠狠高潮亚洲精品| 伊人春色激情综合激情网| 亚洲无人区一区二区三区| 久久亚洲欧美日本精品| 免费无码影视在线观看mov| 国产一区二区不卡自拍| 无码AV无码免费一区二区| 亚洲国产av区一区二| 无码中文字幕av免费放| 美女又黄又免费的视频| 国产成人无码A区在线观| 日韩精品中文字一区二区| 中文字幕精品人妻丝袜| 亚洲av永久无码精品漫画| 亚洲区一区二区三区视频|