Skip to content

Soldy/WebScrapper-Helper

Repository files navigation

The WebScrapper Helper

is an easy-to-use web scrapper collection. The goal is to scrap a site with minimal lines of code.

(new MinerHelperClass(
  DOMElement,
  {
    tittle     : QuerySelectorString,
    url        : {QuerySelectorString, a:'href'},
  },
  'url', //data existance check element
  'http://anyserver/urls',
  {
    'Before' : {
        'type'    : 'click',
        'element' : DOMElement Before Button|Link|Event,
        'text'    : 'ButtonText'
    },
    'next' : {
        'type'    : 'click',
        'element' : DOMElementToNextButton,
        'text'    : 'ButtonText'
    }
  }
)).loop();

Or simply detect the data with the signal.This needs more documentation later.

WebScrapper works well with modern websites, even those without server rendering. Compatible with Greasemonkey and simple_scrapper. Or you can use them together.

Why and AI

The main issue that prompted this project is related to AI. While using AI to implement a scraper itself is not the problem, understanding what the AI is doing can be challenging. The AI can generate a unique JavaScript code that scrapes a website, and it usually gets the job done. However, the real difficulty for us as humans is spending time trying to comprehend how the code works. That's the reason why this tool implements a meta lang that is simple enough for understanding straight away. Use the same functions always.

lets start

The WebScrapper Helper focuses on data mining before storing information, enhancing both visuals and learning experiences. While saving a web page and processing it later is more efficient, it doesn’t provide as rich a visual experience.

CoolHelperClass

The core is the CoolHelperClass. It is not designed as an S.O.L.I.D. object just a couple of tool functions. Each function can return text, HTML code, attribution, or a combination list (map<string, string> or .Object<string, string> if that says something).

The primary goal is to build a list of query selectors and return their corresponding values if they exist. If a particular value does not exist, the function should return an empty string. Simple, right?

I really like the query selector feature, but it has some limitations and can be quite complex. One of the main issues is that it returns null if an element does not exist. While that is acceptable, it can lead to errors when attempting to interact with that element. This issue means you end up writing extra lines of code to handle it. If you need five additional lines just to retrieve the content, that doesn’t feel efficient. It’s not a major problem, but it can complicate things unnecessarily.

iDom

iDom mostly a short hand. iDom is the expectation the empty string rule if the element does not exist, it will return null.

const cH = new CoolHelperClass();

cH.iDom('.class'); //return DOMElement equal with document.queryselector('.class')

If the core element is not the document element that needs to be included.

const cH = new CoolHelperClass();
const iDom = cH.iDom;
iDom(iDom('.class'), 'ul'); //return DOMElement or null equal with document.queryselector('.class').queryselector('ul')

This is just an example and not the best one, because the query selector '.class ul' gives the same result.

Let's combine the querySelector with the querySelectorAll. Ultimately, we want to work with a single DOM element, so we will use the index of the querySelectorAll array at last.

const cH = new CoolHelperClass();
const iDom = cH.iDom;
iDom(iDom('.class'), 'li', 2); //return DOMElement or null equal with document.queryselector('.class').queryselectorAll('li')[2]

This function only serves the DRY principle.

iText

This is where the little magic begins. This function relies on the iDom function but returns the innerText. If the element does not exist, it will return an empty string. You can call it in the same way as the iDom.

const cH = new CoolHelperClass();
cH.iText('.class'); //return string almost equal with document.queryselector('.class').innerText

SwarmManager

Swarm Manager is a multi-scrapper communication tool. When u want to scrap fast, it is key to how many slave workers you have. A slave can be a tab, a new browser process, a vm, or a separate machine.

Personal Note

I built the WebScrapper Helper, and along with Anyserver, it the fourth most powerful web scraper I've developed on my own. (I've created three other scrapers that are more powerful, and I also collaborated with a team on even better projects. It's no surprise that I'm not the smartest person in the room, but I still believe this is a solid tool.) I haven't come across any open-source tools of a similar caliber, which is why I decided to publish it. Since I have limited time, any assistance would be greatly appreciated.

About

easy-to-use web scrapper

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published