Add cdp support for xpath scrapers #625

bnkai · 2020-06-20T20:02:35Z

Adds an extra driver option to the scraper config
Using

driver:
  useCDP: true
  remote: true
  sleep: 3

makes loadURL to use chrome cdp to get the url and thus partially parse it.
Setting useCDP to false or not having the driver option http client is used instead.
If remote is set to true then cdp looks for a remote instance of chrome in the address 127.0.0.1:9222 that is compatible with the chrome headless docker https://hub.docker.com/r/chromedp/headless-shell/
If it is unset or set to false it defaults to looking google-chrome in the $PATH
sleep is the time in seconds to wait before actually geting the page from dom. This is needed as some sites (bang.com for example) need more time for loading scripts to finish. If unset it defaults to 2 secs.

Having this option allows us to create scrapers for sites that pull data from js.

The following puretaboo scraper is possible ( performers and tags can't be retrieved without cdp)

name: puretaboo
sceneByURL:
  - action: scrapeXPath
    url: 
      - https://www.puretaboo.com/en/video
    scraper: sceneScraper
xPathScrapers:
  sceneScraper:
    scene:
      Title: //div[@class="styles_Rms8xL5kg6"]/h2
      Image: //video[@class="vjs-tech"]/@poster
      Date:
        selector: //span[@class="Text ScenePlayer-ReleaseDate-Text styles_3tU3Z2sLeO"]/text()
        parseDate: 2006-01-02
      Details:
        selector: //meta[@name="twitter:description"]/@content
        replace:
          - regex: </br>|<br>
            with: "\n"
      Tags:
        Name:
          selector: //div[@class="BackgroundBox ScenePlayer-SceneCategories-BackgroundBox styles_1khKtnnA8W"]/a
      Performers:
        Name:
          selector: //div[@class="component-ActorThumb-List"]/a/@title
      Studio:
        Name:
          selector: //div[@class="styles_1oxPFmiuVp"]/a/@title
debug:
  printHTML: true
driver:
  useCDP: true
  remote: true
#Last Updated June 23, 2020

Use of the debug option is adviced so that you can look at the actual page text that is returned.

bnkai · 2020-06-20T22:59:17Z

Another example of a cdp only scraper

name: Bang
sceneByURL:
  - action: scrapeXPath
    url:
      - https://www.bang.com/video
    scraper: sceneScraper
xPathScrapers:
  sceneScraper:
    scene:
      Title: //meta[@name="og:title"]/@content
      Details: //meta[@name="description"]/@content
      Image: //meta[@name="og:image"]/@content        
      Date:
        selector: //span[@class="hidden-xs fa-text-right fa-text-left"]/text()
        replace:
          - regex: (\w+\s)(\d+)\w+,(\s\d{4})
            with: $1$2$3
        parseDate: Jan 2 2006
      Tags:
        Name:
          selector: //div[@class="genres bottom-buffer10"]/a
      Performers:
        Name:
          selector: //div/span[@class="fa-text-right" and contains(text(),"With:")]/../span[@class="comma-list-container"]/span/a/text()
      Studio:
        Name: //div/span[@class="fa-text-right" and contains(text(),"Studio:")]/a/text()
debug:
  printHTML: true
driver:
  useCDP: true
  remote: true
  sleep: 5

bnkai · 2020-06-21T13:29:34Z

~~Added a clearCookies option so we have the below with their default values listed~~

driver:
  useCDP: false
  remote: false
  clearCookies: false

clearCookies clears **ALL** browser cookies and then fetches the url. It is mainly needed for managing the remote cdp instance as by default cookies are enabled and stored in the docker container. For users that use their local chrome browser this is not really needed / recomended. I added some helper funtions for setting individual cookies but from what i tested they are not needed since cdp with the remote instance used stores all cookies. If a finer control over cookies is needed later i can revisit the issue.

WithoutPants · 2020-06-21T23:19:46Z

I don't think allowing scraper configs to wipe Chrome cookies is safe or reasonable. Further, because it is in the config, there is no way of tailoring it to docker and non-docker instances.

I think another way is needed to achieve this. I'm not very clear on the protocols, but maybe this might help?

bnkai · 2020-06-22T07:44:05Z

The clearCookies was supposed to be for users that write scrapers , not the end users. I needed to delete some cookies while testing a cdp scraper and didn't know how to. I agree as stated above that it isn't recommended though.
I think it might be better to add support for deleting a specific cookie (name,value,domain) later on if needed and just remove the clearCookies option.

I'll have a look at what you reference, if it creates something like an incognito mode i'll implement it in a later PR.

EDIT merged with upstream code. Removed clearCoookies code. Basic functionality should be complete.

WithoutPants · 2020-06-25T04:29:47Z

I'm not sure that the remote option should be a scraper-specific option. Is there a legitimate reason that two different scrapers will use different values for remote? Seems like it should be a stash configuration option.

bnkai · 2020-06-25T10:39:42Z

Yes i was also reluctant about this. I was either going to put it in there or a scraper option beneath the user-agent setting.
My thinking was that the remote option is only available for the cdp driver and in some scrapers users may want to use their local chrome instance to get past authentication issues ( session) as i haven't yet implemented any set cookie functionality. It doesn't matter for me though as i can add it in the UI.
Was the remote: false option working for you though in windows? I couldn't try that.

WithoutPants · 2020-07-01T04:38:24Z

Tried the following:

copied the puretaboo scraper as-is and tested without changing any paths or docker stuff. Scraped, but did not set performers/tags
set remote to false and retested. Same results as above.
added my chrome directory to the path environment variable and retried. Same results as above.

FWIW chrome isn't named google-chrome on Windows. I'm not sure how else to get this to work.

bnkai · 2020-07-01T14:02:01Z

Thats weird.
If you use cdp and set the remote correctly it should fetch the site. If the performers / tags are not shown you might need to increase the sleep option from the default 2 when not set ( probably the scripts loading the performers/tags werent finished in time )

If there is an error locating a chrome instance you should get a red graphql message like this in the UI
ErrorGraphQL error: exec: "google-chrome": executable file not found in $PATH
that was my error when i set remote to false without having a local instance.

i dont know how cdp locates the chrome binary in windows but since you got some results i am fairly certain it got detected and used. Maybe having a look at the process manager in windows while running a stash cdp query can verify that. (You can increase the sleep option to 5 to force it to wait a little more)

Running against the following urls:

https://www.puretaboo.com/en/video/puretaboo/mommy-monster/132502
https://www.puretaboo.com/en/video/puretaboo/family-vacation/144431
https://www.puretaboo.com/en/video/puretaboo/nerd's-breaking-point/148252
https://www.puretaboo.com/en/video/puretaboo/resisting-arrest/149296
https://www.puretaboo.com/en/video/puretaboo/half-his-age---part-2/126933
https://www.puretaboo.com/en/video/puretaboo/half-his-age%3A-bts-featurette/127740
https://www.puretaboo.com/en/video/puretaboo/swapping-daughters%3A-the-other-family/144420
https://www.puretaboo.com/en/video/puretaboo/dibs-on-mom/147694

i get everything scraped as expected from pure taboo
https://pastebin.com/9eHZK4re

WithoutPants · 2020-07-09T23:14:03Z

My test system already had a puretaboo scraper (under a different filename) which was interfering with this test.

I can confirm that the scraper appears to give the correct results when I add the chrome directory to the PATH. I haven't tried the remote configuration.

I think the chrome configuration needs to be put into the general configuration. A free text field that accepts either a remote chrome instance URL, or a path to the chrome executable. This way the behaviour is clear and deterministic. remote can then be removed from the scraper configuration.

Refactor CDP stuff

bnkai · 2020-08-02T11:26:26Z

@WithoutPants your PR changes work fine using the headless-shell docker as a remote url ( /json/version needs to be present in the url, i added that as an example), thanks.
Adding as a path also worked ( used /usr/bin/google-chrome to be exact)

Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>

bnkai added 3 commits June 20, 2020 21:38

Add cdp support for xpath scrapers

f610979

refactor

a7dd69b

* add sleep to cdp call

57fae8b

bnkai mentioned this pull request Jun 20, 2020

Blocked Scrapers stashapp/CommunityScrapers#81

Closed

bnkai added the improvement Something needed tweaking. label Jun 20, 2020

* refactor, add cookie clear option for cdp

ed1a9e8

bnkai added 2 commits June 21, 2020 21:56

Merge remote-tracking branch 'upstream/develop' into xpath_cdp

65005d5

* update changelog

1fdd5a6

bnkai added 5 commits June 23, 2020 10:51

* get response headers

fc853ca

Merge remote-tracking branch 'upstream/develop' into xpath_cdp

5a96e43

* add comments

92aab62

* remove debug code

cf46662

*remove clearCookies code

22acdac

WithoutPants and others added 6 commits August 1, 2020 14:35

Merge remote-tracking branch 'upstream/develop' into prs/625

a2b1c83

Fix compile error

32c64a1

Refactor url functions into new file

837089e

Change remote setting into global CDP path

d7f033f

Merge pull request #2 from WithoutPants/prs/625

96c1a20

Refactor CDP stuff

Fix formatting, add remote address example

c3b8c28

Documentation

cbae198

WithoutPants added this to the Version 0.3.0 milestone Aug 4, 2020

WithoutPants merged commit 4373f9b into stashapp:develop Aug 4, 2020

WithoutPants mentioned this pull request Aug 12, 2020

[Feature] Feature To Scrape Dynamic xPaths #578

Closed

Tweeticoats pushed a commit to Tweeticoats/stash that referenced this pull request Feb 1, 2021

Add cdp support for xpath scrapers (stashapp#625)

5beda52

Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cdp support for xpath scrapers #625

Add cdp support for xpath scrapers #625

bnkai commented Jun 20, 2020 •

edited

Loading

bnkai commented Jun 20, 2020

bnkai commented Jun 21, 2020 •

edited

Loading

WithoutPants commented Jun 21, 2020

bnkai commented Jun 22, 2020 •

edited

Loading

WithoutPants commented Jun 25, 2020

bnkai commented Jun 25, 2020

WithoutPants commented Jul 1, 2020

bnkai commented Jul 1, 2020

WithoutPants commented Jul 9, 2020

bnkai commented Aug 2, 2020

Add cdp support for xpath scrapers #625

Add cdp support for xpath scrapers #625

Conversation

bnkai commented Jun 20, 2020 • edited Loading

bnkai commented Jun 20, 2020

bnkai commented Jun 21, 2020 • edited Loading

WithoutPants commented Jun 21, 2020

bnkai commented Jun 22, 2020 • edited Loading

WithoutPants commented Jun 25, 2020

bnkai commented Jun 25, 2020

WithoutPants commented Jul 1, 2020

bnkai commented Jul 1, 2020

WithoutPants commented Jul 9, 2020

bnkai commented Aug 2, 2020

bnkai commented Jun 20, 2020 •

edited

Loading

bnkai commented Jun 21, 2020 •

edited

Loading

bnkai commented Jun 22, 2020 •

edited

Loading