← 返回基因目录

law-site-link-discovery

Hybrid knowledge.webimport

Government/law link discovery (whitelisted same-origin). PDF/DOCX/…/ZIP. Whitelist: Shanghai, PBOC, NPC, NFRA, gov, CSRC, SAFE, CAC (cac.gov.cn). List→detail: PBOC, CSRC, SAFE (followDetailPages, default true). Optional pagination, nested depth, collectPdf, NFRA/gov self-page. maxTotalFetches caps GETs. No filesystem.

作者 @sharesummer
v0.3.3 2026年5月7日
有更新版本:v0.4.1 →

README

暂无文档。

基因作者可在发布时添加 README。

表现型

输入

属性类型 必填 描述
seedUrl string HTTP(S) page URL; host must be in network.allowedDomains (whitelisted govt sites).
maxPages integer Max pages to fetch including the seed, when followPagination. Default 1, max 50.
linkScope default | single_page_downloadable single_page: one GET (PBOC still follows details); disables pagination and nested for other sites. default: full rules below.
collectPdf boolean Include .pdf in items. Default true.
maxTotalFetches integer Max HTTP GETs per invocation. Default 200, max 200.
nestedLinkDepth integer 0, 1, or 2. Same-origin navigable hrefs: fetch 1 or 2 levels of child pages to collect more file links. Default 0.
followPagination boolean If true, try to follow rel=next, next link, 下一页; maxPages; SPA may not work.
followDetailPages boolean When true (default): PBOC, CSRC, SAFE follow article links from the list into detail pages to collect file links. When false: only file-like hrefs on the list HTML (often few on CSRC/SAFE list pages).
maxNestedUrlsPerLevel integer Cap of distinct URLs to follow per nested level. Default 12, max 40.

输出

属性类型 必填
site string
error string
items array
原始 JSON Schema

inputSchema

{
  "type": "object",
  "required": [
    "seedUrl"
  ],
  "properties": {
    "seedUrl": {
      "type": "string",
      "description": "HTTP(S) page URL; host must be in network.allowedDomains (whitelisted govt sites)."
    },
    "maxPages": {
      "type": "integer",
      "description": "Max pages to fetch including the seed, when followPagination. Default 1, max 50."
    },
    "linkScope": {
      "enum": [
        "default",
        "single_page_downloadable"
      ],
      "type": "string",
      "description": "single_page: one GET (PBOC still follows details); disables pagination and nested for other sites. default: full rules below."
    },
    "collectPdf": {
      "type": "boolean",
      "description": "Include .pdf in items. Default true."
    },
    "maxTotalFetches": {
      "type": "integer",
      "description": "Max HTTP GETs per invocation. Default 200, max 200."
    },
    "nestedLinkDepth": {
      "type": "integer",
      "description": "0, 1, or 2. Same-origin navigable hrefs: fetch 1 or 2 levels of child pages to collect more file links. Default 0."
    },
    "followPagination": {
      "type": "boolean",
      "description": "If true, try to follow rel=next, next link, 下一页; maxPages; SPA may not work."
    },
    "followDetailPages": {
      "type": "boolean",
      "description": "When true (default): PBOC, CSRC, SAFE follow article links from the list into detail pages to collect file links. When false: only file-like hrefs on the list HTML (often few on CSRC/SAFE list pages)."
    },
    "maxNestedUrlsPerLevel": {
      "type": "integer",
      "description": "Cap of distinct URLs to follow per nested level. Default 12, max 40."
    }
  }
}

outputSchema

{
  "type": "object",
  "required": [
    "site",
    "items"
  ],
  "properties": {
    "site": {
      "type": "string"
    },
    "error": {
      "type": "string"
    },
    "items": {
      "type": "array",
      "items": {
        "type": "object",
        "required": [
          "url",
          "title"
        ],
        "properties": {
          "url": {
            "type": "string"
          },
          "title": {
            "type": "string"
          }
        }
      }
    }
  }
}