Skip to content
/ URS Public
forked from JosephLai241/URS

Universal Reddit Scraper - Scrape Subreddits, Redditors, and submission comments. A command-line tool written in Python (PRAW).

License

Notifications You must be signed in to change notification settings

skiwheelr/URS

 
 

Repository files navigation

FOR SKIWHEELR ADDONS, READ THE FOLLOWING

  1. You should do this on a mac or linux machine.

  2. Follow the below URS setup of Credentials.py file and remove the .txt extension.

  3. Enter "chmod u+x scrapeall" in the URS/urs directory. 2.5) pip3 install -r requirements.txt

  4. also pip3 install packages: praw, jq, and brew install: parallel (GNU)

  5. From URS/urs folder, enter PATH=$PATH:pwd so you can run with one word 4.5) Add the tickers you want to search for in tickers, separated by line

  6. Run the app by typing "scrapeall (l or c depends on whether you want to simply print the discovered links with 'l' or 'c' if you want to scan them as well (outside of reddit api)

  7. Make sure to clean up after you are done by entering the following from the URS/urs folder:

  • rm -rf ../scrapes/* all.json links combilinks

Never push your Credentials.py file, this stays in gitignore and please do not push your scraped data as it is HUGE.

 __  __  _ __   ____  
/\ \/\ \/\`'__\/',__\ 
\ \ \_\ \ \ \//\__, `\
 \ \____/\ \_\\/\____/
  \/___/  \/_/ \/___/... Universal Reddit Scraper 

GitHub top language PRAW Version Build Status GitHub Workflow Status Codecov GitHub release (latest by date) License

Email Say Thanks!

Table of Contents

Introduction

This is a universal Reddit scraper that can scrape Subreddits, Redditors, and comments from submissions.

Written in Python and utilizes the official Reddit API ( PRAW ).

Run pip install -r requirements.txt to get all project dependencies.

You will need your own Reddit account and API credentials for PRAW. See the Getting Started section for more information.

NOTE: PRAW is currently supported on Python 3.5+. This project was tested with Python 3.8.2.

Whether you are using URS for enterprise or personal use, I am very interested in hearing about your use cases and how it has helped you achieve a goal. Please send me an email or leave a note by clicking on the Email or Say Thanks! badge. I look forward to hearing from you!

Contributing

I am currently looking for contributors who have experience with:

  • Unit testing using pytest
  • Packaging Python projects and deploying to PyPI
  • Deploying documentation to ReadTheDocs

If you are interested in contributing, please send me an email at the address listed in the email badge!


Version 3.0.0 is most likely the last major iteration of URS, but I will continue to build upon it as I learn more about computer science.

You can suggest new features or changes by going to the Issues tab and filling out the Feature Request template. If there are good suggestions and a good reason for adding a feature, I will consider adding it.

You are also more than welcome to create a pull request, adding additional features, improving runtime, or streamlining existing code. If the pull request is approved, I will merge the pull request into the master branch and credit you for contributing to this project.

Make sure you follow the contributing guidelines when creating a pull request. See the Contributing document for more information.

URS Overview

Scrape speeds may vary depending on the number of results returned for Subreddit or Redditor scraping, or the submission's popularity (total number of comments) for submission comments scraping. It is also impacted by your internet connection speed.

All exported files are saved within the scrapes/ directory and stored in a sub-directory labeled with the date. These directories are automatically created when you run URS.

Getting Started

It is very quick and easy to get Reddit API credentials. Refer to my guide to get your credentials, then update the API dictionary located in Credentials.py

A Table of All Subreddit, Redditor, and Submission Comments Attributes

These attributes are included in each scrape.

Subreddits Redditors Submission Comments
Title Name Parent ID
Flair Fullname Comment ID
Date Created ID Author
Upvotes Date Created Date Created
Upvote Ratio Comment Karma Upvotes
ID Link Karma Text
Is Locked? Is Employee? Edited?
NSFW? Is Friend? Is Submitter?
Is Spoiler? Is Mod? Stickied?
Stickied? Is Gold?
URL *Submissions
Comment Count *Comments
Text *Hot
  *New
  *Controversial
  *Top
  *Upvoted (may be forbidden)
  *Downvoted (may be forbidden)
  *Gilded
  *Gildings (may be forbidden)
  *Hidden (may be forbidden)
  *Saved (may be forbidden)

*Includes additional attributes; see Redditors section for more information.

Subreddits

Subreddit Demo GIF

*This GIF is uncut.

$ ./Urs.py -r SUBREDDIT [H|N|C|T|R|S] N_RESULTS_OR_KEYWORDS --FILE_FORMAT

You can specify Subreddits, the submission category, and how many results are returned from each scrape. I have also added a search option where you can search for keywords within a Subreddit.

These are the submission categories:

  • Hot
  • New
  • Controversial
  • Top
  • Rising
  • Search

Time filters may be applied to some categories. Here is a table of the categories on which you can apply a time filter as well as the valid time filters.

Categories Time Filters
Controversial All (default)
Search Day
Top Hour
  Month
  Week
  Year

Specify the time filter after the number of results returned or keywords you want to search for: $ ./Urs.py -r SUBREDDIT [C|T|S] N_RESULTS_OR_KEYWORDS OPTIONAL_TIME_FILTER --FILE_FORMAT

If no time filter is specified, the default time filter all is applied. The Subreddit settings table will display None for categories that do not have the additional time filter option.

NOTE: Up to 100 results are returned if you search for something within a Subreddit. You will not be able to specify how many results to keep.

The file names will follow this format: "r-[SUBREDDIT]-[POST_CATEGORY]-[N_RESULTS]-result(s).[FILE_FORMAT]"

If you searched for keywords, file names are formatted like so: "r-[SUBREDDIT]-Search-'[KEYWORDS]'.[FILE_FORMAT]"

If you specified a time filter, -past-[TIME_FILTER] will be appended to the file name before the file format like so: "r-[SUBREDDIT]-[POST_CATEGORY]-[N_RESULTS]-result(s)-past-[TIME_FILTER].[FILE_FORMAT]" or "r-[SUBREDDIT]-Search-'[KEYWORDS]'-past-[TIME_FILTER].[FILE_FORMAT]"

Redditors

Redditor Demo GIF

*This GIF has been cut for demonstration purposes.

$ ./Urs.py -u USER N_RESULTS --FILE_FORMAT

Designed for JSON only.

You can also scrape Redditor profiles and specify how many results are returned.

Some Redditor attributes are sorted differently. Here is a table of how each is sorted.

Attribute Name Sorted By/Time Filter
Comments Sorted By: New
Controversial Time Filter: All
Gilded Sorted By: New
Hot Determined by other Redditors' interactions
New Sorted By: New
Submissions Sorted By: New
Top Time Filter: All

Of these Redditor attributes, the following will include additional attributes:

Submissions, Hot, New, Controversial, Top, Upvoted, Downvoted, Gilded, Gildings, Hidden, and Saved Comments
Title Date Created
Date Created Score
Upvotes Text
Upvote Ratio Parent ID
ID Link ID
NSFW? Edited?
Text Stickied?
  Replying to (title of submission or comment)
  In Subreddit (Subreddit name)

NOTE: If you are not allowed to access a Redditor's lists, PRAW will raise a 403 HTTP Forbidden exception and the program will just append a "FORBIDDEN" underneath that section in the exported file.

NOTE: The number of results returned are applied to all attributes. I have not implemented code to allow users to specify different number of results returned for individual attributes.

The file names will follow this format: "u-[USERNAME]-[N_RESULTS]-result(s).[FILE_FORMAT]"

Submission Comments

Structured Comments Demo GIF Raw Comments Demo GIF

*These GIFs have been cut for demonstration purposes.

$ ./Urs.py -c URL N_RESULTS --FILE_FORMAT

Designed for JSON only.

You can also scrape comments from submissions and specify the number of results returned.

Comments are sorted by "Best", which is the default sorting option when you visit a submission.

There are two ways you can scrape comments: structured or raw. This is determined by the number you pass into N_RESULTS:

Scrape Type N_RESULTS
Structured N_RESULTS >= 1
Raw N_RESULTS = 0

Structured scrapes resemble comment threads on Reddit and will include down to third-level comment replies.

Raw scrapes do not resemble comment threads, but returns all comments on a submission in level order: all top-level comments are listed first, followed by all second-level comments, then third, etc.

Of all scrapers included in this program, this usually takes the longest to execute. PRAW returns submission comments in level order, which means scrape speeds are proportional to the submission's popularity.

NOTE: You cannot specify the number of raw comments returned. The program with scrape all comments from the submission.

The file names will follow this format: "c-[POST_TITLE]-[N_RESULTS]-result(s).[FILE_FORMAT]"

Exporting

URS supports exporting to either CSV or JSON.

Here are my recommendations for scrape exports.

Scraper File Format
Subreddit/Basic CSV or JSON
Redditor JSON
Comments JSON

Subreddit scrapes will work well with either format.

JSON is the more practical option for Redditor and submission comments scraping, which is why I have designed these scrapers to work best in this format.

It is much easier to read the scrape results since Redditor scraping returns attributes that include additional submission or comment attributes.

Comments scraping is especially easier to read because structured exports look similar to threads on Reddit. You can process all the information pertaining to a comment much quicker compared to CSV.

You can still export Redditor data and submission comments to CSV, but you will be disappointed with the results.

See the samples for scrapes ran on June 27, 2020.

Some Linux Tips

  • You can further simplify running the program by making the program executable: sudo chmod +x Urs.py
  • Make sure the shebang at the top of Urs.py matches the location in which your Python3 is installed. You can use which python and then python --version to check. The default shebang is #!/usr/bin/python.
  • Now you will only have to prepend ./ to run URS.
    • ./Urs.py ...
  • Troubleshooting
    • You will have to set the fileformat to UNIX if you run URS with ./ and are greeted with a bad interpreter error. I did this using Vim.

      $ vim Urs.py
      :set fileformat=unix
      :wq!
      

Contributors

Date User Contribution
March 11, 2020 ThereGoesMySanity Created a pull request adding 2FA information to README
October 6, 2020 LukeDSchenk Created a pull request fixing "[Errno 36] File name too long" issue, making it impossible to save comment scrapes with long titles
October 10, 2020 IceBerge421 Created a pull request fixing a cloning error occuring on Windows machines due to illegal filename characters, ", found in two scrape samples

Releases

Release Date Version Changelog
May 25, 2019 URS v1.0
  • Its inception.
July 29, 2019 URS v2.0
  • Now includes CLI support!
December 28, 2019 URS v3.0 (beta)
  • Added JSON export.
  • Added Redditor Scraping.
  • Comments scraping is still under construction.
December 31, 2019 URS v3.0 (Official)
  • Comments scraping is now working!
  • Added additional exception handling for creating filenames.
  • Minor code reformatting.
  • Simplified verbose output.
  • Added an additional submission attribute when scraping Redditors.
  • Happy New Year!
January 15, 2020 URS v3.0 (Final Release)
June 22, 2020 URS v3.1.0
  • Major code refactor. Applied OOP concepts to existing code and rewrote methods in attempt to improve readability, maintenance, and scalability.
  • New in 3.1.0:
    • Scrapes will now be exported to the scrapes/ directory within a subdirectory corresponding to the date of the scrape. These directories are automatically created for you when you run URS.
    • Added log decorators that record what is happening during each scrape, which scrapes were ran, and any errors that might arise during runtime in the log file scrapes.log. The log is stored in the same subdirectory corresponding to the date of the scrape.
    • Replaced bulky titles with minimalist titles for a cleaner look.
    • Added color to terminal output.
  • Improved naming convention for scripts.
  • Integrating Travis CI and Codecov.
  • Updated community documents located in the .github/ directory: BUG_REPORT, CONTRIBUTING, FEATURE_REQUEST, PULL_REQUEST_TEMPLATE, and STYLE_GUIDE
  • Numerous changes to Readme. The most significant change was splitting and storing walkthroughs in docs/.
June 27, 2020 URS v3.1.1
  • Added time filters for Subreddit categories (Controversial, Search, Top).
  • Updated README to reflect new changes.
  • Updated style guide. Made minor formatting changes to scripts to reflect new rules.
  • Performed DRY code review.

About

Universal Reddit Scraper - Scrape Subreddits, Redditors, and submission comments. A command-line tool written in Python (PRAW).

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 99.1%
  • Shell 0.9%