🐸 🐸 🐸 🐸 🐸 🐸 🐸 🐸 🐸 🐸 🐸 🐸 🐸 🐸 🐸 🐸 🐸 🐸 🐸 🐸 🐸 🐸 🐸 🐸 🐸 🐸 🐸 🐸
Meme tracker is a web scraper and image grouper for internet memes.
It scrapes a given URL for images and downloads them. Then, it groups similar images together and displays those image groups as clusters of images in a browser.
-
Install Pipenv for your user.
user@pc:~$ pip install --user pipenv
-
Create a virtual environment for this project and activate it.
user@pc:~$ pip shell
-
Clone this repo.
user@pc:~/projects$ git clone [email protected]:Obleskar/meme_tracker.git
-
Install dependencies.
user@pc:~/projects/meme_tracker$ pipenv install
-
Create a YAML file called
spider_config.yaml
, place it in the project's root directory, and add a list of URLs to scrape.The scraper's currently limited to 4chan boards and 4chan threads.
user@pc:~/projects/meme_tracker$ vim spider_config.yaml
urls: [http://boards.4channel.org/v/]
-
Run the image scraper.
If you provided a board, then every image from every thread on the board's first page will be downloaded.
If you provided a thread, then every image in that thread will be downloaded.
Press
Ctrl
+c
once to top scraping once the current downloads have finished and again to stop scraping immediately.user@pc:~/projects/meme_tracker$ scrapy crawl 4chan_images
-
Launch a local webserver to host the downloaded images.
user@pc:~/projects/meme_tracker$ python3 show_images.py
-
Navigate to http://localhost:5000 in a web browser to view the images in a grid.
To provide an easy way for researchers to view daily summaries of images on the internet.
- Feat: Scrape image URLs from /v
- Feat: Download images
- Feat: Show images in browser
- Feat: Generate thumbnails
- Feat: Justify image grid
- Internal: Change yaml config from dict to list
- Internal: Write names and locations to database
- Feat: Add JupyterNotebooks
- Feat: Add a dhashing notebook
- Feat: Run dhashing notebook with papermill
- Feat: Cluster images
- Feat: Display image cluster "compass" in web browser
- Feat: Write origin post URLs to the database
- Feat: Click image to open origin post in new tab