About this Software

This software navigates to a web link, collects all the links, records their "coordinates" (their getBoundingClientRect position), and saves this data alongside a screenshot of the page.

This coordinate data can be used to examine the "spatial incidence rate" of certain domains or content types (e.g. "How often do Stack Overflow links appear in the top half of a SERP"?, "How often do Wikipedia links appear in the right half of a SERP?" ). You could also use these coordinates to generate a "ranked list". However, for traditional ranking analyses, you may wish to examine software that includes platform-specific parsing.

While written with the goal of studying Search Engine Results pages (SERPs), in theory the software works for any website. See examples below for scraping Google and Bing's homepages.

It uses the open source puppeteer library to automate headless browsing.

Background

The basic concept of using puppeteer for SERP scraping is based on NikolaiT's library se-scraper.

Key differences from se-scaper:

This repo contains a separate, more minimal implementation of the link coordinate collection without the additional scraping features from se-scraper (e.g. use of puppeteer-cluster, specific parsing rules for Google news, etc.).
This repo focuses on spatial analysis, not ranking analyses. The results are links and their coordinates, not ranks. While this has some advantages, there are also limitations to using a spatial approach.
This repo is currently maintained as a side-project by a grad student, and may not be updated as frequently as other similar packages. Contributions and feedback are welcome!

Installation

Pre-reqs

node and npm (most recently run with node 10.15.3 and npm 6.4.1)
python3 distribution (anaconda recommended)

To play with the results_notebook.py, you may want to use a Jupyter-compatible tool, e.g. JupyterLab or VSCode's notebook feature (https://code.visualstudio.com/docs/python/jupyter-support).

Downloading node packages

To install relevant node packages into a local node_modules folder, navigate to this folder (e.g. cd LinkCoordMin) and run:

npm install

Generating search queries

A critical part of studying SERPs is generating relevant search queries. This is a huge topic, so it has a separate README!

See README in query_selection_code/

SERP Collection

The collect.js script runs SERP collection.

There are a variety of named command line args you can pass. Check out collect.js to most directly see the options, or use collect.js -h. You can also see examples below.

Example Pipeline:

See EXAMPLE_RUN.sh to see how you can 4 scripts in a sequence to programmatically generate queries and save SERP data for these queries.

Specific Examples of SERP collection

To run script that

emulates iPhone X using puppeteer's Devices API (--device=iphonex)
searches the Google search engine (by visiting https://www.google.com/search&q=) ((--platform=google))
makes "covid_stems queries" (--queryCat=covid_stems)
from the search_queries/prepped/covid_stems/0.txt file (--queryFile=0)
from the uw location (university of washington lat / long /zip) (--geoName=uw)
to dir test (--outDir=test)

node collect.js --device=iphonex --platform=google --queryCat=covid_stems --queryFile=0 --geoName=uw --outDir=test

For bing & no location spoofing: node collect.js --device=iphonex --platform=bing --queryCat=covid_stems --queryFile=0 --geoName=None --outDir=output/test

For bing on Chrome/Windows and a single test query (q = 'covid') node collect.js --device=chromewindows --platform=bing --queryCat=test --queryFile=0 --geoName=None --outDir=output/test0

To run google and bing at the same time (using & for parallel):

node collect.js --device=chromewindows --platform=google --queryCat=covid_stems --queryFile=0 --geoName=None --outDir=output/covidout_mar20 & node collect.js --device=chromewindows --platform=bing --queryCat=covid_stems --queryFile=0 --geoName=None --outDir=output/covidout_mar20 & wait

This software can collect data for websites other than SERPs as well!

To scrape reddit, we just create a queryCat called reddit. The software will look at search_queries/reddit/0.txt and visit any websites listed there. node collect.js --device=chromewindows --platform=reddit --queryCat=reddit --queryFile=0 --geoName=None --outDir=output/reddit

Similarly, to visit search engine homepages. node collect.js --device=chromewindows --platform=se --queryCat=homepages --queryFile=0 --geoName=None --outDir=output/reddit

Note that --sleepMin and --sleepMax default to 15 and 30 (seconds) respectively. You may wish to make these larger for longer jobs to avoid being rate limited (see discussion in the se-scraper repo).

Running many query categories with a python scirpt

see covid.py for a script that collects a variety of COVID-19 related data.
This script is a useful template for running a bunch of tasks at once, or setting up regular data collection.

Data visualization and analysis

See WikipediaSERP.html for a worked example
See results_notebook.py for details. If you're not using an Anaconda environment, you may need to pip install dependencies like pandas, matplotlib, etc.

results_notebook.py is formatted for use with VsCode's interactive jupyter notebook features. You can alternatively use the results_notebook.ipynb version (updated semi-regularly) or just run results_notebook.py as a Python script.

e.g. set SAVE_PLOTS to True, then run results_notebook.py > my_results.txt

Known Issues and Debugging

Location spoofing is inconsistent and the feature most likely to break. If performing any location-specific analyses, consider doing extra manaul validation for data quality!
Bing mobile pages only loads top results (appears to be 4-6 items). The bottom half of the page is left with placeholder images, e.g. it hasn't loaded the full page yet. When this issue first arose, the "scrollDown" function seemed to fix it (issues scroll action til the bottom is reached).
Reddit sometimes has issues loading
Duckduckgo has some hard-to-replicate bugs when location spoofing.

Using headfull mode for developmet

Pass --headless=0
This is very useful for debugging, you can watch the web browser in real time!
If you are interested in helping to debug any issues with the software (including new issues that may arise as SERPs change), consider using headfull mode and watching the software "in action".

Misc

run node tests/testStealth.js to see how puppeteer-extra-stealth is doing. This library is meant to help puppeteer scripts avoid detection, i.e. so websites don't detect the script.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.ipynb_checkpoints		.ipynb_checkpoints
location_spoofing		location_spoofing
notebooks		notebooks
notes		notes
nov5_example/Chrome on Windows/https---www.google.com-search?q=protests		nov5_example/Chrome on Windows/https---www.google.com-search?q=protests
query_selection_code		query_selection_code
scripts		scripts
search_queries		search_queries
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analyze_links.py		analyze_links.py
collect.js		collect.js
emulatedDevices.js		emulatedDevices.js
helpers.py		helpers.py
my_constants.py		my_constants.py
package.json		package.json
results_notebook_android_vs_ios.html		results_notebook_android_vs_ios.html
results_notebook_android_vs_ios.ipynb		results_notebook_android_vs_ios.ipynb
results_notebook_android_vs_ios.py		results_notebook_android_vs_ios.py
utils.js		utils.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About this Software

Background

Installation

Pre-reqs

Downloading node packages

Generating search queries

SERP Collection

Example Pipeline:

Specific Examples of SERP collection

Running many query categories with a python scirpt

Data visualization and analysis

Known Issues and Debugging

Using headfull mode for developmet

Misc

About

Releases

Packages

Contributors 2

Languages

License

nickmvincent/LinkCoordMin

Folders and files

Latest commit

History

Repository files navigation

About this Software

Background

Installation

Pre-reqs

Downloading node packages

Generating search queries

SERP Collection

Example Pipeline:

Specific Examples of SERP collection

Running many query categories with a python scirpt

Data visualization and analysis

Known Issues and Debugging

Using headfull mode for developmet

Misc

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages