url-extractor

Native text.extract

Extracts and categorizes all URLs from text content, with optional validation of link accessibility.

README

# url-extractor

A Native Gene that extracts and categorizes all URLs from text content.

## Usage

```bash
rotifer test url-extractor --input '{
"text": "Visit https://rotifer.dev for docs. Contact [email protected] for help.",
"includeEmails": true
}'
```

## Features

- Extract HTTP/HTTPS/FTP URLs from any text
- Optional email address extraction
- Automatic deduplication
- Domain categorization
- Character position tracking for each URL

## Input

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `text` | string | Yes | Text to extract URLs from |
| `includeEmails` | boolean | No | Also extract emails (default: false) |
| `deduplicate` | boolean | No | Remove duplicates (default: true) |

## Output

| Field | Type | Description |
|-------|------|-------------|
| `urls` | array | Extracted URLs with protocol, domain, position |
| `emails` | array | Extracted emails (if enabled) |
| `totalFound` | number | Total URL count |
| `uniqueDomains` | string[] | List of unique domains found |

Phenotype

Input

Property	Type	Req	Description
text	string	✓	Text content to extract URLs from
deduplicate	boolean = true		Remove duplicate URLs
includeEmails	boolean = false		Also extract email addresses

Output

Property	Type	Description
urls	array	Extracted URLs with metadata
emails	array	Extracted email addresses (if includeEmails is true)
totalFound	number	Total URLs found
uniqueDomains	array	List of unique domains

Raw JSON Schema

inputSchema

{
  "type": "object",
  "required": [
    "text"
  ],
  "properties": {
    "text": {
      "type": "string",
      "description": "Text content to extract URLs from"
    },
    "deduplicate": {
      "type": "boolean",
      "default": true,
      "description": "Remove duplicate URLs"
    },
    "includeEmails": {
      "type": "boolean",
      "default": false,
      "description": "Also extract email addresses"
    }
  }
}

outputSchema

{
  "type": "object",
  "properties": {
    "urls": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "url": {
            "type": "string"
          },
          "domain": {
            "type": "string"
          },
          "position": {
            "type": "number",
            "description": "Character offset in source text"
          },
          "protocol": {
            "type": "string",
            "description": "http, https, ftp, etc."
          }
        }
      },
      "description": "Extracted URLs with metadata"
    },
    "emails": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "description": "Extracted email addresses (if includeEmails is true)"
    },
    "totalFound": {
      "type": "number",
      "description": "Total URLs found"
    },
    "uniqueDomains": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "description": "List of unique domains"
    }
  }
}

Arena History

Date	Fitness	Safety	Calls
Mar 17	0.7730	0.92	1