← Back to Gene Catalog

url-extractor

Native text.extract

Extracts and categorizes all URLs from text content, with optional validation of link accessibility.

README

# url-extractor

A Native Gene that extracts and categorizes all URLs from text content.

## Usage

```bash
rotifer test url-extractor --input '{
"text": "Visit https://rotifer.dev for docs. Contact [email protected] for help.",
"includeEmails": true
}'
```

## Features

- Extract HTTP/HTTPS/FTP URLs from any text
- Optional email address extraction
- Automatic deduplication
- Domain categorization
- Character position tracking for each URL

## Input

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `text` | string | Yes | Text to extract URLs from |
| `includeEmails` | boolean | No | Also extract emails (default: false) |
| `deduplicate` | boolean | No | Remove duplicates (default: true) |

## Output

| Field | Type | Description |
|-------|------|-------------|
| `urls` | array | Extracted URLs with protocol, domain, position |
| `emails` | array | Extracted emails (if enabled) |
| `totalFound` | number | Total URL count |
| `uniqueDomains` | string[] | List of unique domains found |

Phenotype

Input

PropertyType Req Description
text string Text content to extract URLs from
deduplicate boolean = true Remove duplicate URLs
includeEmails boolean = false Also extract email addresses

Output

PropertyType Description
urls array Extracted URLs with metadata
emails array Extracted email addresses (if includeEmails is true)
totalFound number Total URLs found
uniqueDomains array List of unique domains
Raw JSON Schema

inputSchema

{
  "type": "object",
  "required": [
    "text"
  ],
  "properties": {
    "text": {
      "type": "string",
      "description": "Text content to extract URLs from"
    },
    "deduplicate": {
      "type": "boolean",
      "default": true,
      "description": "Remove duplicate URLs"
    },
    "includeEmails": {
      "type": "boolean",
      "default": false,
      "description": "Also extract email addresses"
    }
  }
}

outputSchema

{
  "type": "object",
  "properties": {
    "urls": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "url": {
            "type": "string"
          },
          "domain": {
            "type": "string"
          },
          "position": {
            "type": "number",
            "description": "Character offset in source text"
          },
          "protocol": {
            "type": "string",
            "description": "http, https, ftp, etc."
          }
        }
      },
      "description": "Extracted URLs with metadata"
    },
    "emails": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "description": "Extracted email addresses (if includeEmails is true)"
    },
    "totalFound": {
      "type": "number",
      "description": "Total URLs found"
    },
    "uniqueDomains": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "description": "List of unique domains"
    }
  }
}

Arena History

Date Fitness Safety Calls
Mar 17 0.7730 0.92 1