-
-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing Issue #728 #1131
base: main
Are you sure you want to change the base?
Fixing Issue #728 #1131
Conversation
WalkthroughA new spider class, Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
I did most of the style changes for flake8. There's just the issue of having a line too long in the test for testing links. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
🧹 Outside diff range and nitpick comments (9)
tests/test_chi_northeastern_il_university.py (5)
27-31
: Consider expanding title tests for better coverage.While the current test checks the titles of the first two parsed items, it might be beneficial to test more items or use a parameterized test to cover a wider range of cases. This would help ensure that all types of titles are correctly parsed.
34-47
: Consider expanding tests for better coverage of parsed items.The current tests for description, start time, end time, and time notes only check the first parsed item. To ensure comprehensive coverage, consider:
- Testing these attributes for multiple items.
- Using parameterized tests to cover various scenarios.
- Adding edge cases, such as items with unusual time formats or descriptions.
This would help catch potential parsing issues across different types of entries.
50-58
: Enhance robustness of id, status, and classification tests.
For the
test_id
function:
- Consider using a dynamic date based on the frozen time instead of a hardcoded date to make the test more resilient to changes in test data.
- Example:
f"chi_northeastern_il_university/{parsed_items[0]['start'].strftime('%Y%m%d%H%M')}/x/board_meeting"
For all three tests:
- Consider testing multiple items or using parameterized tests to ensure comprehensive coverage across different types of entries.
These changes would make the tests more robust and less prone to failing due to changes in test data.
Also applies to: 97-98
72-94
: Consider expanding source and links tests for better coverage.While the current tests for source and links are thorough for the first item, consider:
- Testing the source for multiple items to ensure consistency across all parsed entries.
- Using parameterized tests for links to cover various scenarios (e.g., items with different numbers of links, different link types).
- Adding edge cases, such as items with no links or unusual link formats.
This would help ensure that the parsing logic handles various cases correctly across all items.
1-103
: Overall: Well-structured test suite with room for expanded coverage.This test file for the
ChiNortheasternIlUniversitySpider
is well-organized and covers all major attributes of the parsed items. The use of pytest and freezegun demonstrates good testing practices, ensuring consistent and reliable tests.Areas for improvement:
- Expand coverage for tests that currently only check the first item (e.g., description, start/end times, source, links).
- Utilize more parameterized tests to cover various scenarios and edge cases.
- Consider adding tests for error handling and edge cases in the spider's parsing logic.
These enhancements would further strengthen the test suite and help catch potential issues across a wider range of scenarios.
city_scrapers/spiders/chi_northeastern_il_university.py (1)
52-54
: Remove unused methodgetMeetingDetails
The method
getMeetingDetails
is not called anywhere in the code. Removing unused methods helps keep the codebase clean and maintainable.Consider removing the method:
-def getMeetingDetails(self, response): - print(response.text)test2.html (3)
164-175
: Missing hyperlinks in navigation headersIn the "Academics" navigation menu (lines 164-222), some primary list items like "Taking Classes," "Academic Units," and "Opportunities" lack hyperlinks. Users might expect these headers to be clickable.
Consider adding hyperlinks to these headers or restructuring the menu to improve user experience.
Also applies to: 179-195, 199-222
475-485
: Optimize social media image alt attributesWhile alt attributes are provided for social media icons, consider making them more descriptive for better accessibility. For example, instead of "Facebook Icon," use "Follow us on Facebook."
Update the alt attributes:
- <img alt="Facebook Icon" src="/sites/default/files/2021-10/facebook_logo_white.png"> + <img alt="Follow us on Facebook" src="/sites/default/files/2021-10/facebook_logo_white.png">Apply similar changes to other social media icons.
978-980
: Format address for better readability and accessibilityThe address is presented using
<br />
tags (lines 978-980). This can be problematic for screen readers.Consider using proper block elements like
<p>
or<address>
for semantic markup.- <p style="text-align: center;"><span>Northeastern Illinois University</span><br /> - 5500 North St. Louis Avenue<br /> - Chicago, Illinois 60625-4699<br /> - (773) 583-4050</p> + <address style="text-align: center;"> + <span>Northeastern Illinois University</span><br /> + 5500 North St. Louis Avenue<br /> + Chicago, Illinois 60625-4699<br /> + (773) 583-4050 + </address>
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
- city_scrapers/spiders/chi_northeastern_il_university.py (1 hunks)
- test2.html (1 hunks)
- tests/files/chi_northeastern_il_university.html (1 hunks)
- tests/test_chi_northeastern_il_university.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
- tests/files/chi_northeastern_il_university.html
🧰 Additional context used
🪛 Ruff
city_scrapers/spiders/chi_northeastern_il_university.py
56-56: Use
item != ""
instead ofnot item == ""
Replace with
!=
operator(SIM201)
🔇 Additional comments (3)
tests/test_chi_northeastern_il_university.py (3)
1-25
: LGTM: Imports and test setup are well-structured.The imports, test response setup, spider instantiation, and time freezing are all implemented correctly. This provides a solid foundation for the subsequent tests.
61-69
: LGTM: Comprehensive location testing.The
test_location
function provides good coverage by testing multiple items and checking both the name and address of the locations. This approach helps ensure that different location formats are correctly parsed.
101-103
: LGTM: Excellent use of parameterized testing for 'all_day' attribute.The
test_all_day
function uses pytest's parameterized testing feature effectively to check the 'all_day' attribute across all parsed items. This approach ensures comprehensive coverage and helps catch any inconsistencies in the parsing of this attribute.
head = meeting.css("h4.accordion::text").get().split() | ||
if len(head) >= 3: | ||
date = " ".join(head[:3]) | ||
title = " ".join(head[3:]) if len(head) > 3 else "" | ||
else: | ||
date = head | ||
title = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve date and title parsing logic
The current logic for parsing date
and title
from head
may fail if the format of head
changes or doesn't meet expectations. This can lead to incorrect data extraction.
Consider using more robust parsing by checking the content of head
:
def parse(self, response):
for meeting in response.css("div.board-meeting-materials-row.views-row"):
head = meeting.css("h4.accordion::text").get().split()
- if len(head) >= 3:
- date = " ".join(head[:3])
- title = " ".join(head[3:]) if len(head) > 3 else ""
- else:
- date = head
- title = ""
+ date_parts = []
+ title_parts = []
+ for part in head:
+ if re.match(r'\w+ \d{1,2}, \d{4}', ' '.join(date_parts + [part])):
+ date_parts.append(part)
+ else:
+ title_parts.append(part)
+ date = ' '.join(date_parts)
+ title = ' '.join(title_parts) if title_parts else ""
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
head = meeting.css("h4.accordion::text").get().split() | |
if len(head) >= 3: | |
date = " ".join(head[:3]) | |
title = " ".join(head[3:]) if len(head) > 3 else "" | |
else: | |
date = head | |
title = "" | |
head = meeting.css("h4.accordion::text").get().split() | |
date_parts = [] | |
title_parts = [] | |
for part in head: | |
if re.match(r'\w+ \d{1,2}, \d{4}', ' '.join(date_parts + [part])): | |
date_parts.append(part) | |
else: | |
title_parts.append(part) | |
date = ' '.join(date_parts) | |
title = ' '.join(title_parts) if title_parts else "" |
time = re.findall(pattern, parse)[-1] | ||
midDay = re.search(replacementPattern, time).group(0) | ||
trueTime = ( | ||
time.replace(midDay, " AM").strip() | ||
if "a" in midDay | ||
else time.replace(midDay, " PM").strip() | ||
) | ||
fullDate = date + " " + trueTime | ||
return datetime.strptime(fullDate, "%B %d, %Y %I:%M %p") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling for empty regex results in _parse_end
In the _parse_end
method, if parse
is None
or the regex search doesn't find any matches, accessing [-1]
on the result of re.findall
will raise an IndexError
.
Ensure that parse
is not None
and that re.findall
returns a non-empty list before proceeding:
def _parse_end(self, date, parse):
+ if not parse:
+ # Handle the case where parse is None or empty
+ return None
pattern = re.compile(
r"\d{1,2}:\d{1,2}.[a-z]{0,1}\.{0,1}[a-z]{0,1}\.{0,1}", re.MULTILINE
)
replacementPattern = re.compile("[^0-9:].*")
times = re.findall(pattern, parse)
+ if not times:
+ # Handle the absence of time strings
+ return None
- time = times[-1]
+ time = times[-1]
Committable suggestion was skipped due to low confidence.
res = requests.get(agenda) | ||
details = extract_text(BytesIO(res.content)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid using requests
within Scrapy spiders
Using requests.get
at lines 32-33 is not recommended within Scrapy spiders, as it bypasses Scrapy's built-in features like asynchronous request handling, caching, and adherence to robots.txt
. Instead, utilize Scrapy's Request
or response.follow
methods to make HTTP requests.
Refactor the code to use scrapy.Request
and handle the agenda PDF download asynchronously. Here's how you can modify the code:
-import requests
+from scrapy import Request
...
- res = requests.get(agenda)
- details = extract_text(BytesIO(res.content))
+ yield Request(
+ url=agenda,
+ callback=self.parse_agenda,
+ meta={'date': date, 'title': title, 'links': links}
+ )
# Then, define the parse_agenda method to handle the response
+ def parse_agenda(self, response):
+ details = extract_text(BytesIO(response.body))
+ meeting = Meeting(
+ title=self._parse_title(response.meta['title']),
+ description="",
+ classification=self._parse_classification(response.meta['title']),
+ start=self._parse_start(response.meta['date'], details),
+ end=self._parse_end(response.meta['date'], details),
+ all_day=self._parse_all_day(),
+ time_notes="",
+ location=self._parse_location(details),
+ links=response.meta['links'],
+ source=self._parse_source(response),
+ )
+ meeting["status"] = self._get_status(meeting)
+ meeting["id"] = self._get_id(meeting)
+ yield meeting
Committable suggestion was skipped due to low confidence.
pattern = re.compile(r"(\d\d\d\d.*\n?)(?=\s*Meeting)", re.MULTILINE) | ||
match = re.search(pattern, item) | ||
location = match.group(1).strip().split("|") | ||
return { | ||
"address": location[0].strip() + ", " + location[1].strip(), | ||
"name": location[2].strip(), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling for regex search in _parse_location
In _parse_location
, re.search(pattern, item)
may return None
if no match is found. Attempting to access match.group(1)
when match
is None
will raise an AttributeError
.
Include a check to handle cases where match
is None
:
def _parse_location(self, item):
pattern = re.compile(r"(\d\d\d\d.*\n?)(?=\s*Meeting)", re.MULTILINE)
match = re.search(pattern, item)
+ if not match:
+ # Handle the absence of a matching location string
+ return {"address": "", "name": ""}
location = match.group(1).strip().split("|")
Committable suggestion was skipped due to low confidence.
<div class="mega-nav-version-2-link-div" id="mega-nav-version-2-link-financial-aid-div"> | ||
<a class="mega-nav-version-2-link" id="mega-nav-version-2-link-financial-aid" href="https://admissions.neiu.edu/"> | ||
Admissions | ||
</a> | ||
<div class="mega-nav-version-2-link-dropdown" id="mega-nav-version-2-link-dropdown-financial-aid"> | ||
<p class="mega-nav-version-2-dropdown-heading">Learn About NEIU</p> | ||
<div class="mega-nav-version-2-dropdown-row"> | ||
<div class="mega-nav-version-2-dropdown-col"> | ||
<ul> | ||
<li><a href="https://admissions.neiu.edu/visit">Visit</a></li> | ||
<li><a href="/academics/programs">Our Programs</a></li> | ||
<li><a href="https://admissions.neiu.edu/admission-requirements">Requirements</a></li> | ||
<li><a href="https://admissions.neiu.edu/admitted-students">Admitted Students</a></li> | ||
<li><a href="https://admissions.neiu.edu/your-success">Your Success</a></li> | ||
<li><a href="/mosaic">Profiles</a></li> | ||
</ul> | ||
|
||
</div> | ||
</div> | ||
</div> | ||
</div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate id
attributes in navigation menu
The id
attribute mega-nav-version-2-link-financial-aid-div
is used multiple times in the navigation menu (lines 90 and 112). IDs must be unique within an HTML document to ensure proper functioning of scripts and styles.
Apply this diff to make the IDs unique:
- <div class="mega-nav-version-2-link-div" id="mega-nav-version-2-link-financial-aid-div">
+ <div class="mega-nav-version-2-link-div" id="mega-nav-version-2-link-financial-aid-div-2">
Ensure to update any corresponding references in your CSS and JavaScript code.
Also applies to: 112-153
I'll start checking what the ai reviewer suggests at a later date. |
Summary
Issue: #728
Checklist
I know the issue is "claimed" and assigned to someone else but it's been 4 years
and I can't find the spider in the most recent repo so I decided to do it.
Summary by CodeRabbit
New Features
Bug Fixes
Documentation