Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph implementation #18

Merged
merged 5 commits into from
Feb 14, 2024
Merged

Graph implementation #18

merged 5 commits into from
Feb 14, 2024

Conversation

PeriniM
Copy link
Collaborator

@PeriniM PeriniM commented Feb 14, 2024


Title: Implement Graph-Based Scraping Logic with SmartScraper

Description:

This pull request introduces a graph-based approach to web scraping, centralizing around the implementation of a new class, SmartScraper. The SmartScraper class serves as a base for constructing scraping workflows using a directed graph of nodes, each representing a distinct step in the scraping process, such as fetching HTML, extracting probable tags, and generating answers based on user queries.

Key components added in this PR include:

  • BaseNode and its subclasses (FetchHTMLNode, GetProbableTagsNode, ParseHTMLNode, GenerateAnswerNode, ConditionalNode) for creating versatile and reusable scraping operations.
  • BaseGraph for managing the execution flow among nodes.
  • The SmartScraper class, which encapsulates the graph logic and simplifies the creation of scraping tasks.

Example Usage:

Below is a brief example demonstrating how to use the SmartScraper to extract information from a webpage:

from yosoai.graphs import SmartScraper

OPENAI_API_KEY = ''

llm_config = {
    "api_key": OPENAI_API_KEY,
    "model_name": "gpt-3.5-turbo",
}

url = "https://perinim.github.io/projects/"
prompt = "List me all the titles and project descriptions"

smart_scraper = SmartScraper(prompt, url, llm_config)

answer = smart_scraper.run()
print(answer)

Future Work

  • Add tests;
  • Expand the node library to cover more scraping and data processing scenarios;
  • Implement error handling and retry logic for robustness against web scraping challenges.

Copy link

Dependency Review

✅ No vulnerabilities or license issues found.

Scanned Manifest Files

requirements-dev.txt
  • setuptools@65.5.1
requirements.txt
  • beautifulsoup4@4.12.3
  • pytest@8.0.0
  • setuptools@65.5.1

@PeriniM PeriniM added enhancement New feature or request dependencies Pull requests that update a dependency file labels Feb 14, 2024
@PeriniM PeriniM merged commit 7f9a004 into main Feb 14, 2024
3 checks passed
@PeriniM PeriniM deleted the graph_implementation branch February 15, 2024 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant