Linux网络编程之使用套接字socket实现网页抓取

HTTP是一个客户端终端(用户)和服务器端(网站)请求和应答的标准(TCP)。通过使用网页浏览器、网络爬虫或者其它的工具,客户端发起一个HTTP请求到服务器上指定端口(默认端口为80)。我们称这个客户端为用户代理程序(user agent)。应答的服务器上存储着一些资源,比如HTML文件和图像。下面我们利用Fiddler抓包看看,我们访问网站时浏览器发送了哪些报文信息:

GET http://www.xieyincai.com/hello.html HTTP/1.1
Host: www.xieyincai.com
Connection: keep-alive
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding: gzip, deflate, sdch
Accept-Language: zh-CN,zh;q=0.8

我们知道浏览器发送了这些报文信息就好办了,下面用socket编程模拟浏览器发送这些报文即可。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <netdb.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <arpa/inet.h>

#define BUF_SIZE 4096

int main(void)
{
	int sockfd, nbytes;
	struct sockaddr_in serv_addr;
	struct hostent *host;
	char domain[] = "www.xieyincai.com";
	int port = 80;

	char request[] = "GET /hello.html HTTP/1.1\r\n"
	                 "Host: www.xieyincai.com\r\n\r\n";
	char response[BUF_SIZE+1];

	if((host = gethostbyname(domain)) == NULL)
	{
		perror("gethostbyname error\n");
		return 1;
	}

	memset(&serv_addr, 0, sizeof(serv_addr));
	serv_addr.sin_family = AF_INET;
	serv_addr.sin_addr = *(struct in_addr *)host->h_addr;
	serv_addr.sin_port = htons(port);

	if((sockfd = socket(AF_INET, SOCK_STREAM,0)) < 0)
	{
		perror("socket error\n");
		return 1;
	}

	if(connect(sockfd, (struct sockaddr *)&serv_addr, sizeof(serv_addr)) < 0)
	{
		perror("connect error\n");
		close(sockfd);
		return 1;
	}

	if(write(sockfd, request, strlen(request)) != strlen(request))
	{
		perror("write error\n");
		close(sockfd);
		return 1;
	}

	while((nbytes = read(sockfd, response, BUF_SIZE)) > 0)
	{
		response[nbytes] = '\0';
		printf("%s", response);
	}

	if(nbytes < 0)
	{
		perror("read error\n");
		close(sockfd);
		return 1;
	}

	close(sockfd);
	return 0;
}

运行结果如下:

[root@fedora Workspace]# gcc -o webcontent webcontent.c
[root@fedora Workspace]# ./webcontent
HTTP/1.1 200 OK
Date: Sun, 08 Jul 2018 06:27:35 GMT
Server: Apache
Last-Modified: Fri, 06 Jul 2018 13:34:09 GMT
ETag: "a00007f-7e-57054b7fcbb3c"
Accept-Ranges: bytes
Vary: Accept-Encoding,User-Agent
Transfer-Encoding: chunked
Content-Type: text/html

<!DOCTYPE html>
<html>
<head>
<title>Hello World</title>
</head>
<body>
<center>Hello World</center>
</body>
</html>

1 thought on “Linux网络编程之使用套接字socket实现网页抓取

  1. 东风破

    代码测试过,可以运行,解决了我的小问题。这篇博文收藏了,谢谢博主分享。

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *