一、前言

在写这篇博客之前，笔者是爬取了深圳市政务公开、政府公报、政府工作、新闻报道、政策解读等所有文件，由于这些网页的大体结构都差不多，本文则主要介绍爬取政务公开文件（包括市政府令、市政府文件、市政府函、市政府办公厅文件、市政府办公厅函、部门规范性文件），网站地址可点击此处查看，也可参考下图所示。

二、获取文件URL列表

1.获取各类文件的URL

如上图所示，可以看到政务公开文件有很多个不同的类别，点击各个类别右侧的更多就能到相应类型的网址，下面是政务公开文件各个类型的网址。

target_urls = [
        'http://www.sz.gov.cn/zfwj/zfwjnew/szfl_139222/index.htm',  # 市政府令
        'http://www.sz.gov.cn/zfwj/zfwjnew/szfwj_139223/index.htm', # 市政府文件
        'http://www.sz.gov.cn/zfwj/zfwjnew/szfh_139224/index.htm',  # 市政府函
        'http://www.sz.gov.cn/zfwj/zfwjnew/szfbgtwj_139225/index.htm',  # 市政府办公厅文件
        'http://www.sz.gov.cn/zfwj/zfwjnew/szfbgth_139226/index.htm',   # 市政府办公厅函
        'http://www.sz.gov.cn/zfwj/zfwjnew/bmgfxwjnew/index.htm'    # 部门规范性文件
]

2.获取每类文件的总页数

如下图所示，可以看到左侧菜单就是政务公开文件的各个类型，正文部分就是相应文件的列表，下方圈注的则是每类文件的总页数。通过点击下一页，可以看到每页的URL其实就是在最初的URL后面加上了页数，比如:

http://www.sz.gov.cn/zfwj/zfwjnew/szfl_139222/index_1.htm
http://www.sz.gov.cn/zfwj/zfwjnew/szfl_139222/index_3.htm
http://www.sz.gov.cn/zfwj/zfwjnew/szfl_139222/index_6.htm

所以首先我们需要获取每类文件的总页数，然后再到每页上获取文件的URL列表。

审查网页元素定位到总页数的所在位置，然后通过解析html的文本获取总的页数，详细的代码可参考下面。首先需要说明get_html_text函数，它的作用是获取网页的html文本内容，返回str类型，该函数是爬取所有网页数据的基础。而get_total_page_num函数的作用则是获取每类文件的总页数。

def get_html_text(url, params=None, proxies=None, total=3):
    """
    get the text of target url
    :param url: target url
    :param params: params dict, like {'param1': 'value1', 'param2': 'value2'}
    :param proxies: proxies dict, like {'http': 'proxy1', 'https': proxy2}, its keys are unchangeable
    :param total: max repeat times if time out
    :return: html text or None
    """
    try:
        headers = {'User-Agent': random.choice(agent)}
        r = requests.get(url, params=params, proxies=proxies, headers=headers, timeout=5)
    except requests.exceptions.ConnectionError:  # connection error
        print('Connection Error:', url)
        return None
    except Exception:
        if total > 0:
            time.sleep(5)  # after 5 seconds continue to craw
            return get_html_text(url, params, proxies, total - 1)
        return None
    # get encodings of the website
    encodings = requests.utils.get_encodings_from_content(r.text)
    if len(encodings) == 0:
        r.encoding = 'gbk'
    else:
        r.encoding = encodings[0]
    return r.text

def get_total_page_num(url, prefix, suffix, delimiter):
    """
    获取文件通知的总页数
    :param url: target url
    :param prefix: 分割的前缀
    :param suffix: 分割的后缀
    :param delimiter: 分隔符
    :return: int page number
        eg. html structure like '<script>createPageHTML(50, 0, "index","htm",849);</script>'
        then prefix can be 'createPageHTML(', suffix can be ');</script>' and delimiter is ','
    """
    html_page = get_html_text(url)
    prev = html_page.split(prefix)[1]  # 截掉前半部分
    suff = prev.split(suffix)[0]  # 截掉后半部分
    page_num = suff.split(delimiter)[0]  # 分割拿到页数
    # print(page_num)
    return int(page_num)

获取到了总的页数之后，需要拼接出每一个网页的URL，就是单纯的字符串操作，直接放代码了。

def get_all_pages_urls(self, url, page_num):
    """
    根据 page number，拼接出所有页数的 url
    :param url: target url
    :param page_num: page number
    :return: page_urls list
    """
    urls = [url]  # add first page url
    url_pre = url.replace('.htm', '')
    for i in range(1, page_num):
        full_url = '%s_%s.htm' % (url_pre, i)
        urls.append(full_url)
    return urls

3.获取每个网页上的文件URL

首先同样需要获取网页的html文本，然后需要通过xpath解析得到每页文件的超链接，这里主要使用的是lxml包中的etree解析HTML，代码如下。由于每个文件的超链接都不一定是完整的URL，所以部分是需要拼接的。

def get_info_urls_of_public(self, url):
    """
    爬取政府文件网页上通知文件的链接
    :param url: target url
    :return: info_urls list
    """
    page_content = get_html_text(url)
    html = etree.HTML(page_content)
    urls = html.xpath('//div[@class="zx_ml_list"]/ul/li/div/a/@href')
    urls = list(map(lambda x: 'http://www.sz.gov.cn' + x.split('..')[-1] if 'http' not in x else x, urls))
    return urls

三、爬取文件内容

1.爬取文件的基本信息和内容

如下图所示，每个文件包含有索引号、分类、发布机构、名称、发布日期、文号和主题词等基本信息，这些都是需要爬取的。

同样通过审查元素定位到它们的位置，然后通过xpath解析得到相应的值，具体的实现代码如下所示。值得注意的是，获取到了这些基本信息后，需要对其进行encode编码，不然后面保存到excel中会出错。

def get_notification_infos(self, url):
    """
    在通知文件页面爬取通知的内容
    :param url: info url
    :return: base_infos， content
    """
    page_content = get_html_text(url)
    html = etree.HTML(page_content)
    # ['索引号', '省份', '城市', '文件类型', '文号', '发布机构', '发布日期', '标题', '主题词']
    index = html.xpath('//div[@class="xx_con"]/p[1]/text()')
    aspect = html.xpath('//div[@class="xx_con"]/p[2]/text()')
    announced_by = html.xpath('//div[@class="xx_con"]/p[3]/text()')
    announced_date = html.xpath('//div[@class="xx_con"]/p[4]/text()')
    title = html.xpath('//div[@class="xx_con"]/p[5]/text()')
    document_num = html.xpath('//div[@class="xx_con"]/p[6]/text()')
    key_word = html.xpath('//div[@class="xx_con"]/p[7]/text()')
    base_infos = [index, ['广东省'], ['深圳市'], aspect, document_num, announced_by, announced_date, title, key_word]
    # encode是为了保存Excel，否则出错
    base_infos = list(map(lambda x: x[0].encode('utf-8') if len(x) > 0 else ' ', base_infos))
    # print('This is basic info: ', base_infos)
    paragraphs = html.xpath('//div[@class="news_cont_d_wrap"]//p')  # 段落信息
    contents = []
    for paragraph in paragraphs:
        contents.append(paragraph.xpath('string(.)').strip())
    contents = '\n'.join(contents)
    # deal with attachments
    attachments = []
    script_str = html.xpath('//div[@class="fjdown"]/script/text()')[0]
    # if there are attachments, then get attachments from script
    if script_str.find('var linkdesc="";') == -1:
        attach_names = script_str.split('var linkdesc="')[-1].split('";')[0].split(';')
        attach_urls = script_str.split('var linkurl="')[-1].split('";')[0].split(';')
        suffix = url.split('/')[-1]
        for k in range(len(attach_urls)):
            attach_url = url.replace(suffix, attach_urls[k].split('./')[-1])
            attach_name = attach_names[k].replace('/', '-').replace('<', '(').replace('>', ')')
            # print(attach_name, attach_url)
            attachments.append([attach_name, attach_url])
    # print(contents)
    return base_infos, contents, attachments

2.下载相应的附件

上面代码中的最后部分是在处理附件信息，因为附件是通过javascript生成的，因此是无法直接通过xpath获取到附件的URL信息的。只能分割html文本字符串，找出附件的URL和附件名信息。下面的代码就是传入附件URL和名字，进行附件下载，而且任何类型的附件都是可以下载的，包括word、excel、pdf、mp4等等。

def download_file(file_url, filename, total=3):
    """
    下载文件
    :param file_url: 文件URL
    :param filename: 文件名
    :param total: 最大下载次数（防止失败）
    :return: bool value
    """
    try:
        res = requests.get(file_url)
        if res.status_code == 200:
            fp = open(filename, mode='wb')
            fp.write(res.content)
            fp.close()
        else:
            print('File Not Found:', file_url)
    except Exception:
        if total > 0:
            return download_file(file_url, filename, total - 1)
        return False
    return True

四、保存结果

1.保存单个文件内容到word

在前面的步骤中，我们已经爬取到了文件的内容，这里就将内容存入到word文档中，并以通知的名称作为保存的word文件名。但是，由于windows中一些特殊符号是无法保存为文件名的，则需要将它们都替换掉（比如”/“、”<”、”>”等），以及文件名不能过长。之前对文件的基本信息进行了encode编码，所以此处需要decode解码。

def write_notification_to_docx(self, save_dir, base_infos, contents, attachments):
    """
    保存通知的详细信息到word文件
    :param save_dir: 保存的路径
    :param base_infos: 通知的基本信息（'索引号', '省份', '城市', '文件类型', '文号', '发布机构', '发布日期', '标题', '主题词'）
    :param contents: 通知内容
    :param attachments: 附件信息
    :return: None
    """
    title = bytes.decode(base_infos[-2]).replace('/', '-').replace(' ', '') \
        .replace('<', '(').replace('>', ')').replace('"', '-')
    # 下载附件
    if len(attachments) > 0:
        # 有附件则要新建目录
        save_dir = save_dir + '/' + title
        os.mkdir(save_dir)
        for attachment in attachments:
            suffix = attachment[1].split('.')[-1]
            attach_name = attachment[0] + '.' + suffix if suffix not in attachment[0] else attachment[0]
            # 不下载mp4视频，速度太慢
            if suffix != 'mp4':
                download_file(file_url=attachment[1], filename=save_dir + '/' + attach_name)
    # 处理文件名过长
    filename = save_dir + '/' + title + '.docx'
    filename = save_dir + '/...' + title[-50:] + '.docx' if len(filename) > 180 else filename
    # 写入word文档
    write_word_file(filename=filename,
                    title=title, data_list=[contents])

def write_word_file(filename, title, data_list):
    """
    write text data to word file
    :param filename: word filename, like '/path/filename.doc' or '/path/filename.docx'
    :param title: title for word file, or maybe None
    :param data_list: paragraphs of word content
    :return: None
    """
    doc = docx.Document()
    if title:
        doc.add_heading(title, 0)
    for data in data_list:
        doc.add_paragraph(data)
    doc.save(filename)

2.保存所有文件基本信息到excel

为了方便查看，笔者这里将所有文件的基本信息都已写入到了excel文件中，主要依托下面的write_excel_file函数，作用是向已经存在的excel文件中增加新的类别sheet，并写入每类文件的基本信息。注意一定先新建要保存的excel文件，不然保存的时候就会提示文件不存在错误。

def run(self, excel_name, sheet_name, save_dir, target_url):
    """
    主程序运行的入口
    :param excel_name: 保存的Excel表名
    :param sheet_name: 保存的Excel表sheet名
    :param save_dir: 保存文本的目录
    :param target_url: 要爬取的home url
    :return: None
    """
    base_info_list = []
    # 以下三个需要根据网页的 html 结构来设置
    prefix = 'createPageHTML('
    suffix = ');'
    delimiter = ','
    num = get_total_page_num(target_url, prefix, suffix, delimiter)
    page_urls = self.get_all_pages_urls(target_url, num)
    for page_url in page_urls:
        # info_urls = self.get_info_urls_of_public(page_url)    # 政府文件用
        info_urls = self.get_info_urls_of_policy(page_url)  # 政策解读用
        for info_url in info_urls:
            print('Get Information From ', info_url)
            if '' != info_url:
                base_infos, contents, attachments = self.get_notification_infos(info_url)
                # 剔除掉没有爬到内容的通知
                if contents.strip() != '':
                    self.write_notification_to_docx(save_dir, base_infos, contents, attachments)
                    base_info_list.append(base_infos)
    # 写入信息到Excel
    if len(base_info_list) > 0:
        columns = ['索引号', '省份', '城市', '文件类型', '文号', '发布机构', '发布日期', '标题', '主题词']
        write_excel_file(filename=excel_name, data_list=base_info_list,
                         sheet_name=sheet_name, columns=columns)

def write_excel_file(filename, data_list, sheet_name='Sheet1', columns=None):
    """
    write list data to excel file, supporting add new sheet
    :param filename: excel filename, like '/path/filename.xlsx' or '/path/filename.xls'
    :param data_list: list data for saving
    :param sheet_name: excel sheet name, default 'Sheet1'
    :param columns: excel column names
    :return: None
    """
    writer = pd.ExcelWriter(filename)
    frame = pd.DataFrame(data_list, columns=columns)
    book = load_workbook(writer.path)
    writer.book = book
    frame.to_excel(excel_writer=writer, sheet_name=sheet_name, index=None)
    writer.close()

最终保存的excel结果可以参看下图，可以下方看到sheet名表示着不同类别的文件，而且文件的基本信息都是很完整的。

五、致谢（含源代码）

最后，非常感谢大家的观看，有问题的小伙伴也可以在下方评论留言。这里笔者已将所有源代码都公开共享到了github上，包括爬取政务公开、政府公报、新闻报道、政策解读等的所有文件，有需要的可点击此处前往下载。

...

00:00

爬取深圳市政府政务公开所有文件