Skip to content

Add Firecrawl integration for website content crawling #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

malinosqui
Copy link
Member

@malinosqui malinosqui commented Jun 9, 2025

Description

Related Issues

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Release
  • Refactor
  • Other (please describe):

Testing

Screenshots (if applicable)

Additional Notes


This pull request introduces a new integration with the Firecrawl API to enhance the web crawling capabilities of the kodustech/onlook project. The key changes include:

  1. Chat API Enhancements:

    • A new processUrls function is added to the chat API endpoint in apps/web/client/src/app/api/chat/route.ts. This function processes embedded URLs in user messages using an AI model, modifying the last user message before further processing. New imports for URL extraction and AI tooling are included, along with logic in the POST handler to invoke this function.
  2. AI Tool Handling:

    • In apps/web/client/src/app/project/[id]/_hooks/use-chat.tsx, handling for a new AI tool, CRAWL_URL_TOOL_NAME, is introduced. This includes importing necessary constants and adding a conditional block in the tool dispatcher, currently as a placeholder.
  3. Crawler Service Implementation:

    • A new CrawlerService module is added in packages/ai/src/tools/crawler.ts for web crawling using the Firecrawl API. It includes TypeScript interfaces for crawl options, API responses, and crawled content structure. A validation function checks the integrity of crawler responses. The CrawlerService class is implemented as a singleton, handling API key initialization and providing an async crawlUrl method with error handling.
  4. Utility Function for URL Extraction:

    • A utility function extractUrls is introduced in packages/ai/src/tools/helpers.ts to parse text strings, identify and validate various forms of URLs, and return a unique set of valid URLs.
  5. AI Toolset Expansion:

    • A new crawlUrlTool is added to the AI toolset in packages/ai/src/tools/index.ts, defining its parameters with Zod for URL crawling and specifying options like crawl limits and content formats.
  6. Chat Message Context Extension:

    • The chat message context system is extended in packages/models/src/chat/message/context.ts by adding a new type, LinkMessageContext, to represent hyperlinks. This involves updating the MessageContextType enum and including the new type in the ChatMessageContext union type.

These changes collectively enhance the project's ability to process and analyze web content through improved URL handling and web crawling functionalities.

Co-Authored-By: kiet@onlook.dev <kiet@onlook.dev>
Copy link

kodus-qa bot commented Jun 9, 2025

Code Review Completed! 🔥

The code review was successfully completed based on your current configurations.

Kody Guide: Usage and Configuration
Interacting with Kody
  • Request a Review: Ask Kody to review your PR manually by adding a comment with the @kody start-review command at the root of your PR.

  • Provide Feedback: Help Kody learn and improve by reacting to its comments with a 👍 for helpful suggestions or a 👎 if improvements are needed.

Current Kody Configuration
Review Options

The following review options are enabled or disabled:

Options Enabled
Security
Code Style
Kody Rules
Refactoring
Error Handling
Maintainability
Potential Issues
Documentation And Comments
Performance And Optimization
Breaking Changes

Access your configuration settings here.

@@ -10,6 +10,8 @@ import {
ONLOOK_INSTRUCTIONS_TOOL_NAME,
READ_FILES_TOOL_NAME,
READ_FILES_TOOL_PARAMETERS,
CRAWL_URL_TOOL_NAME,
CRAWL_URL_TOOL_PARAMETERS,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kody code-review Maintainability critical

// If CRAWL_URL_TOOL_PARAMETERS are not used by the CRAWL_URL_TOOL_NAME handler,
// this import might be unnecessary. Ensure it's used if intended, otherwise consider removing:
// CRAWL_URL_TOOL_PARAMETERS,

Unused or improperly utilized CRAWL_URL_TOOL_PARAMETERS in apps/web/client/src/app/project/[id]/_hooks/use-chat.tsx. The parameter is imported but not used in the handler logic (lines 14, 66-67).

This issue appears in multiple locations:

  • apps/web/client/src/app/project/[id]/_hooks/use-chat.tsx: Lines 14-14
  • apps/web/client/src/app/project/[id]/_hooks/use-chat.tsx: Lines 66-67
    Please ensure CRAWL_URL_TOOL_PARAMETERS is either used in the handler logic or remove the import to avoid clutter.

Talk to Kody by mentioning @kody

Was this suggestion helpful? React with 👍 or 👎 to help Kody learn from this interaction.

content: content,
},
],
tools: { [CRAWL_URL_TOOL_NAME]: chatToolSet[CRAWL_URL_TOOL_NAME] as any },
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kody code-review Maintainability medium

tools: { [CRAWL_URL_TOOL_NAME]: chatToolSet[CRAWL_URL_TOOL_NAME] as any /* TODO: Replace 'any' with the specific type for this tool if known. The original suggestion's 'SpecificToolType' was a placeholder. */ },

Multiple instances of using any type in TypeScript code, which bypasses type checking and reduces maintainability. In apps/web/client/src/app/api/chat/route.ts, lines 26 and 46 use as any and (m: any) respectively.

This issue appears in multiple locations:

  • apps/web/client/src/app/api/chat/route.ts: Lines 26-26
  • apps/web/client/src/app/api/chat/route.ts: Lines 46-46
    Please replace any with specific types or interfaces to maintain type safety. If the type is complex or dynamic, add a comment explaining the rationale for using any.

Talk to Kody by mentioning @kody

Was this suggestion helpful? React with 👍 or 👎 to help Kody learn from this interaction.

Comment on lines +66 to +67
} else if (toolName === CRAWL_URL_TOOL_NAME) {
return 'Web content crawled successfully.';
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kody code-review Refactoring medium

        } else if (toolName === CRAWL_URL_TOOL_NAME) {
            const params = toolCall.args as z.infer<typeof CRAWL_URL_TOOL_PARAMETERS>; // Assuming parameters are parsed and used
            return await handleCrawlUrlTool(params, editorEngine); // Call to a new dedicated handler function
        }

The logic for CRAWL_URL_TOOL_NAME is implemented inline within handleToolCall. Other tools in this function (e.g., LIST_FILES_TOOL_NAME, READ_FILES_TOOL_NAME) delegate to separate handler functions like handleListFilesTool. For consistency, improved readability, and easier maintenance (especially if the crawl logic becomes complex), consider extracting the CRAWL_URL_TOOL_NAME handling into its own dedicated async function, similar to the pattern used for other tools.

Talk to Kody by mentioning @kody

Was this suggestion helpful? React with 👍 or 👎 to help Kody learn from this interaction.

Comment on lines +52 to +59
const firstItem = response.data[0];
return (
typeof firstItem === 'object' &&
firstItem !== null &&
('html' in firstItem || 'markdown' in firstItem) &&
(firstItem.html === undefined || typeof firstItem.html === 'string') &&
(firstItem.markdown === undefined || typeof firstItem.markdown === 'string')
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kody code-review Potential Issues high

    // This loop executes only if response.data is not empty,
    // due to the check on lines 48-50.
    for (const item of response.data) {
        if (
            !(
                typeof item === 'object' &&
                item !== null &&
                ('html' in item || 'markdown' in item) &&
                (item.html === undefined || typeof item.html === 'string') &&
                (item.markdown === undefined || typeof item.markdown === 'string')
            )
        ) {
            return false; // Found an invalid item
        }
    }
    return true; // All items are valid

The validateCrawlerResponse function currently only validates the structure of the first item in the response.data array. If response.data contains multiple items, subsequent items are not validated. This could lead to runtime errors if other items in the array do not conform to the expected { html?: string; markdown?: string; } structure, as the function would incorrectly return true. Consider iterating over all items in response.data to ensure each one meets the validation criteria.

Talk to Kody by mentioning @kody

Was this suggestion helpful? React with 👍 or 👎 to help Kody learn from this interaction.

}
return response;
} catch (error) {
console.error('Error during crawling:', error);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kody code-review Maintainability medium

            // Let the caller handle logging and any error wrapping
            throw error;

The crawlUrl method currently logs errors to console.error before re-throwing them. While helpful for debugging, this direct logging can make the service harder to integrate into applications with specific logging strategies. Consider removing the console.error call and letting the caller decide how to handle/log the thrown error. This improves modularity and makes the service more versatile.

Talk to Kody by mentioning @kody

Was this suggestion helpful? React with 👍 or 👎 to help Kody learn from this interaction.

@@ -30,8 +31,44 @@ export const onlookInstructionsTool = tool({
parameters: z.object({}),
});

export const CRAWL_URL_TOOL_NAME = 'crawl_url';
export const CRAWL_URL_TOOL_PARAMETERS = z.object({
urls: z.array(z.string()).describe('Array of URLs to crawl'),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kody code-review Potential Issues high

urls: z.array(z.string().url()).describe('Array of URLs to crawl'),

The urls parameter is described as 'Array of URLs to crawl', but the schema z.array(z.string()) allows any string, not just valid URLs. To ensure data integrity and prevent potential downstream errors in CrawlerService if it expects valid URL formats, consider using z.string().url().

Talk to Kody by mentioning @kody

Was this suggestion helpful? React with 👍 or 👎 to help Kody learn from this interaction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant