Scrapers and Parsers for prisontalk.com forums. My approach is actually to download pages of html and then parse them after the fact. Parsers have ‘test pages’ in the examples folder which cover all potential errors, differences, and edge cases between pages.
IMPORTANT: PLEASE MAKE SURE TO SEE NOTE BELOW REGARDING THE CREATION OF USERNAMES
The notebooks contained extract the following data:
Prison Talk Thread Scraper
- Log in/out
- Get list of forums.
- Get list of threads in forums.
- Download HTML pages of posts.
- Automatic randomized breaks (e.g. throttling, looking like a human)
- Logging progress
- Sending email progress reports
- Sending critical failures as text messages
Prison Talk Forum List Parser
- Forum number
- Forum name
- Number of threads
- Number of posts
- Description
Prison Talk Thread Parser
- Thread ID
- Thread title
- Thread replies
- Number of views
Prison Talk Archive Parser/Scraper
Prisontalk.com has an archive version which is very simple to parse. This is to be used to parse the actual content of the thread which will be joined with other information which can only be obtained from the live site.
- Date
- Post ID
- User
- Post body
Prison Talk Thank You Scraper
Prison Talk has a gratitude mechanism in which users can ‘thank’ someone for their post. This is one piece of information the live site contains which is not available on the archive.
- Date of thanks
- The post ID of the thanked post
- Thanker username
- Thanker user ID
Prison Talk Members List Parser
- Member ID
- Username
- Role (e.g. admin)
- Join date
- Number of posts
- Last visit
- Profile pic
- Birthday
Prison Talk User Profile Parser (requires login)
- Username
- Member ID
- About Me
- Last Post
- Number of thanks received/given
- Number of friends
- Last visit
- Join date
OTHER: Notebooks from a smaller version of this project which performs topic modeling of the data from the archive. Can’t yet be simply ‘plugged in’ with the rest of the data but changes will be made so that it can be used in the future.
Prison Talk Topic Modeling
Prison Talk Document Processing
There are some (only one if I recall) circumstances in which you may need to login in order scrape the pages. You can find this out by simply using the site. For reasons I’ll get to below, if you do not have to login to the site (even when simply exploring it) don’t.
So look in the code for places to add your username and password where necessary but note (and I cannot overstate this enough):
DO NOT UNDER ANY CIRCUMSTANCES DO THE FOLLOWING:
- Create more than one username from the same IP address.
- Use any of the same or similar identifying information when creating a username and password.
- Log-in with more than one username from the same IP address.
This will be detected, and they will request that they usernames be merged. Repeated offenses will probably get you banned. Please see the ‘cookies’ section in Prison Talk Scrapers.ipynb for more information.