This has been left abandoned due to reCaptcha issues arising in the website; and now a rewrite in Python seems the better approach. Favouring the stealth library scenarios.
Okay, I have an idea.
Its a website I want to scrape from, but it has this, say, many buttons underneath many buttons kind of structure, with documents all attached to them as blobs (pretty sure it was a React job but who cares).
So I'd like to implement a git system, which checks for the files and the folders and updates accordingly, comparing the hashes.
If a new folder or section is created, a new folder is made in my laptop and accordingly all the files underneath are downloaded.
(Every section: new folder; every button: new folder; every document: new file.)
com.scrapektu.app
|
| - App.java (End integrator)
|
| - model/
| | -Node.java (Defines a class of tree nodes with attributes)
| | -NodeType.java (List of enums to do the job)
|
| - session/
| | -BrowserSession.java
|
| - scraper/
| | - TreeBuilder.java (Accesses the JSON/YAML and then makes a tree from the layout)
|
| - sync/ (Stage 2)
| | - SyncEngine.java
| | - RepoManager.java
| | - FileHasher.java
|
| - net/
| | - Downloader.java
Session.java → Starts browser (no scraping)
Scrape.java → Navigates website (all scraping logic)
Node.java → Holds scraped data (tree structure)
JsonUtil.java → Save/load tree
SyncEngine.java → Compare trees + download
RepoManager.java → Write folders/files
Pull from: