HttpClient爬虫

1、模拟Get请求爬取Html

1
2
3
4
5
6
7
8
CloseableHttpClient httpClient =HttpClients.createDefault();
HttpGet get = new HttpGet("http://192.168.100.2:8080");
CloseableHttpResponse response = httpclient.execute(get);
HttpEntity entity = response.getEntity();
if (entity != null) {
System.out.println(EntityUtils.toString(entity));
}
response.close();

2、模拟Post请求登录

2.1、登陆原理

这里首先要理解WEB项目是如何识别用户已经登录的。一般情况下,用户登录WEB项目后,WEB项目会将用户的登录信息保存在session中用以识别用户是否已经登录。那么WEB项目又是如何将不同用户不同浏览器的请求与在服务器端保存的session相匹配的呢?
答案是cookie。
用户浏览器访问WEB服务器后 ,默认会向浏览器写入名为JSESSIONID的cookie。当用户请求服务器后,服务器会读取该cookie的值,用于匹配出用户对应的session。
所以,爬虫模拟登陆最关键的部分是保存cookie。

2.2、HttpClient默认管理cookie

幸运的是,HttpClient4.x的版本已经默认自动保存发送cookie,基本不需要开发者管理cookie。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
//创建默认HttpClient,不同的HttpClient之间的cookie不能共享
CloseableHttpClient httpClient =HttpClients.createDefault();
//模拟登陆
HttpPost post = new HttpPost("http://192.168.100.2:8080");
List<NameValuePair> params = new ArrayList<NameValuePair>();
params.add(new BasicNameValuePair("username", ""));
params.add(new BasicNameValuePair("password", ""));
params.add(new BasicNameValuePair("roleId", "3"));
post.setEntity(new UrlEncodedFormEntity(params, "UTF-8"));
httpclient.execute(post);
//获取登陆后首页
HttpGet get = new HttpGet("http://192.168.100.2:8080/index");
CloseableHttpResponse response = httpclient.execute(get);
HttpEntity entity = response.getEntity();
if (entity != null) {
System.out.println(EntityUtils.toString(entity));
}
response.close();

2.3、使用HttpClient的CookieStore管理cookie

1
2
3
//CookieStore的好处是方便开发者管理cookie,即使是不同的HttpClient对象,使用同一个CookieStore,也可以保持同一个用户的cookie和session。
CookieStore cookieStore = new BasicCookieStore();
CloseableHttpClient httpclient= HttpClientBuilder.create().setDefaultCookieStore(cookieStore).build();

3、模拟下载文件

下载和get请求爬取HTML同理,只不过是以流的形式获取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
//请先模拟post登陆后再模拟下载
HttpGet httpGet = new HttpGet("http://192.168.100.2:8080/file/download?file_id=24");
CloseableHttpResponse response = httpclient.execute(httpGet);
File file=new File("D:/test.doc");
InputStream in = response.getEntity().getContent();
OutputStream out = new FileOutputStream(file);
int len;
byte[] tmp = new byte[1024];
while ((len = in.read(tmp)) != -1) {
out.write(tmp, 0, len);
}
out.close();
in.close();
response.close();

4、模拟上传文件

由于没有现成的系统可以上传文件,想测试的可以自己新建一个WEB项目实现上传功能。

1
2
3
4
5
6
7
8
9
10
11
HttpPost httpPost = new HttpPost("上传地址");
FileBody bin = new FileBody(new File("上传文件"));
HttpEntity reqEntity = MultipartEntityBuilder.create()
.setMode(HttpMultipartMode.BROWSER_COMPATIBLE)
.addPart("uploadFile", bin)
.setCharset(CharsetUtils.get("UTF-8")).build();
httpPost.setEntity(reqEntity);
CloseableHttpResponse response = httpclient.execute(httpPost);
String html = EntityUtils.toString(response.getEntity());
response.close();
System.out.println(html);

5、form表单提交

1
2
3
4
5
6
7
8
HttpClient httpClient = HttpClients.createDefault();
HttpPost post = new HttpPost(url);
post.setHeader(HttpHeaders.CONTENT_TYPE, MediaType.APPLICATION_FORM_URLENCODED_VALUE);
List<NameValuePair> form = new LinkedList<>();
form.add(new BasicNameValuePair("message", content));
UrlEncodedFormEntity urlEncodedFormEntity = new UrlEncodedFormEntity(form);
post.setEntity(urlEncodedFormEntity);
HttpResponse response = httpClient.execute(post);