Idea

This has been left abandoned due to reCaptcha issues arising in the website; and now a rewrite in Python seems the better approach. Favouring the stealth library scenarios.

Well, anyway, it was fun.

Okay, I have an idea.
Its a website I want to scrape from, but it has this, say, many buttons underneath many buttons kind of structure, with documents all attached to them as blobs (pretty sure it was a React job but who cares). So I'd like to implement a git system, which checks for the files and the folders and updates accordingly, comparing the hashes. If a new folder or section is created, a new folder is made in my laptop and accordingly all the files underneath are downloaded. (Every section: new folder; every button: new folder; every document: new file.)

com.scrapektu.app
|
| - App.java              (End integrator)
|
| - model/
| | -Node.java           (Defines a class of tree nodes with attributes)
| | -NodeType.java       (List of enums to do the job)
|
| - session/
| | -BrowserSession.java
|
| - scraper/
| | - TreeBuilder.java    (Accesses the JSON/YAML and then makes a tree from the layout)
|
| - sync/   (Stage 2)
| | - SyncEngine.java
| | - RepoManager.java
| | - FileHasher.java
|
| - net/
| | - Downloader.java

Session.java       → Starts browser     (no scraping)
Scrape.java        → Navigates website  (all scraping logic)
Node.java          → Holds scraped data (tree structure)
JsonUtil.java      → Save/load tree
SyncEngine.java    → Compare trees + download
RepoManager.java   → Write folders/files

Pull from:

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.mvn		.mvn
.settings		.settings
src		src
target		target
.classpath		.classpath
.gitignore		.gitignore
.project		.project
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Idea

This has been left abandoned due to reCaptcha issues arising in the website; and now a rewrite in Python seems the better approach. Favouring the stealth library scenarios.

Well, anyway, it was fun.

About

Uh oh!

Languages

License

caveman210/scrape-ktu-java

Folders and files

Latest commit

History

Repository files navigation

Idea

This has been left abandoned due to reCaptcha issues arising in the website; and now a rewrite in Python seems the better approach. Favouring the stealth library scenarios.

Well, anyway, it was fun.

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages