GoldData学习实例-采集官网新闻数据

来源：cnblogs　　作者：dataman100　　时间：2019/3/18 8:55:54　　对本文有异议

概述

在本节中，我们将讲述抓取政府官网地方新闻。并将抓取的新闻数据融入到以下两张数据表news_site和news中。

source1

news_site（新闻来源）

字段	类型	说明
id	bigint	主键，自动增长
name	varchar(128)	来源名称

news（新闻）

字段	类型	说明
id	bigint	主键，自动增长
title	varchar（128)	标题
site_id	bigint	外键，指向表news_site的id字段
content	text	内容
pub_date	datetime	发布时间
date_created	datetime	加入时间

我们很容易看到这两张表存在关联，那是怎样将数据写入关联呢，我们将再此一一介绍。

定义站点、数据集

define_site

define_dataset

定义抓取和抽取规则

在这里我们需要填入口地址。入口地址如果有多个，那么要以英文逗号相隔。如下图所示：

entry

接下来我们编写规则时，首先是匹配URL，这里需要填写正则表达式。旁边的“？”号，点击后就会弹出相应的帮助文档。如下图所示：

url_match

然后数据集选择则我们要注意，如果抓取的仅需要的是链接，那么是否数据集选择否，且数据集字段必须要有一个名为href的字段。如下图所示：

dataset_href

否则是否数据集应该选择是，且数据集字段必须要有一个名为sn的字段。sn字段存放的数据一般是唯一值，相当于数据表里的id字段。如下图所示：

dataset_sn

完整的规则内容显示如下：

[
  {
    __sample: http://sousuo.gov.cn/column/40520/0.htm
    match0: http\:\/\/sousuo\.gov\.cn\/column\/40520/\d+\.htm
    fields0:
    {
      __model: false
      __node: .news_box a
      href:
      {
        expr: a
        attr: abs:href
        js: ""
        __label: 链接
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
    }
  }
  {
    __sample: http://www.gov.cn/xinwen/2019-02/26/content_5368539.htm
    match0: http\:\/\/www\.gov\.cn/xinwen/2019-\d{2}/\d{2}/content_\d+.htm
    fields0:
    {
      __model: true
      __dataset: news
      __node: ".article "
      sn:
      {
        expr: ""
        attr: ""
        js:
          '''
          var xx=md5(baseUri)
          xx
          '''
        __label: 编号
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
      title:
      {
        expr: .article >h1
        attr: ""
        js: ""
        __label: 标题
        __showOnList: true
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
      pubdate:
      {
        expr: .pages-date:matchText
        attr: ""
        js: ""
        __label: 发布时间
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
      source:
      {
        expr: .pages-date > span.font:contains(来源)
        attr: ""
        js:
          '''
          var xx=source.replace("来源：",'');
          xx
          '''
        __label: 来源
        __showOnList: true
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
      content:
      {
        expr: .pages_content
        attr: ""
        js: ""
        __label: 新闻内容
        __showOnList: false
        __type: ""
        down: "0"
        accessPathJs: ""
        uploadConf: ""
      }
    }
  }
]