This is a scraper for the Corona/COVID-19 dashboard for Berlin, as issued by the Senatsverwaltung für Gesundheit, Pflege und Gleichstellung (Senate Department for Health, Care and Equality) and the Landesamt für Gesundheit und Soziales (Regional Office for Health and Social Affairs). The dashboard includes daily case numbers by district and age groups, as well as the "Corona traffic light"-indicators (incidence of new infections per week, ICU occupancy rate, relative change in incidence).
- Change in the 7-day incidence was dropped again on September 2nd, 2021. Instead, the 7-day indicence for hospitalisation was introduced as a new traffic light indicator on that day. Also, the color scheme for the incidence traffic light was adjusted: where before it was yellow at >=20% and red at >=40%, it is now yellow at >=35% and red at >=100%. This of course makes it pointless to compare the color values before and after 2021-09-02.
- The basic reproduction number R was included until July 22nd, 2021.
After that, R was dropped because it was no longer deemed a useful indicator.
Instead, the relative change in incidence is now the third indicator in the corona traffic light.
R is still included in the output data (as
0.0
), since some apps might expect to find it there. - Starting with February 15, 2021, the absolute numbers and percentages for administered vaccinations are included in the dashboard.
- Starting with November 11, 2020, the change in the 7-day incidence is also included in the traffic light indicators.
- As of August 31, 2020, the dashboard replaces the previously used daily press releases containing the same data. There were two separate press releases each day, one with the case numbers (see here for the last one) and one with the traffic light indicators (see here for the last one of those).
The output of the scraper are timelines of data extracted from the individual press releases.
Note that there is no data for 2021-05-06, because on that day, the reporting was changed from same-day-evening to next-day-noon. See this note for more information.
The timeline data generated by the case number scraper is a JSON file in data/target/berlin_corona_cases.json, structured as follows:
[
{
"date": "2020-09-23",
"source": "https://www.berlin.de/corona/lagebericht/desktop/corona.html",
"counts_per_district": {
"lor_01": {
"case_count": 2141,
"indicence": 555.0,
"recovered": 1894
},
"lor_02": {
"case_count": 1359,
"indicence": 468.0,
"recovered": 1176
},
...
"lor_12": {
"case_count": 974,
"indicence": 365.6,
"recovered": 897
}
},
"counts_per_age_group": {
"0-4": {
"case_count": 358,
"incidence": 188.7
},
"5-9": {
"case_count": 406,
"incidence": 236.5
},
...
"90+": {
"case_count": 156,
"incidence": 500.6
},
"unknown": {
"case_count": 15,
"indidence": "n.a."
}
}
},
{
"date": "2020-09-22",
"source": "https://www.berlin.de/corona/lagebericht/desktop/corona.html",
"counts_per_district": {
"lor_01": {
"case_count": 2116,
"indicence": 548.5,
"recovered": 1877
},
"lor_02": {
"case_count": 1330,
"indicence": 458.0,
"recovered": 1154
},
...
"lor_12": {
"case_count": 967,
"indicence": 363.0,
"recovered": 890
}
},
"counts_per_age_group": {
"0-4": {
"case_count": 356,
"incidence": 187.6
},
"5-9": {
"case_count": 399,
"incidence": 232.4
},
...
"90+": {
"case_count": 155,
"incidence": 497.4
},
"unknown": {
"case_count": 15,
"indidence": "n.a."
}
}
},
...
]
The structure of the data is a JSON array with objects for each day.
For each day, the source
is specified (where was the data scraped from – this used to be a particular press release, now it is always the dashboard), the date
(of the data), the counts_per_district
and the counts_per_age_group
.
For the days scraped from individual press releases, there is also a pr_date
attribute (when was the press release issued), because pr_date
and date
are not always the same day.
The counts_per_district
objects are structured with a key for each district, which in turn contain the actual numbers for the total case_count
, incidence
and number of recovered
cases.
The district keys are their LOR codes (see the dataset Lebensweltlich orientierte Räume (LOR) in Berlin for a complete definition of each LOR code):
{
"lor_01": "Mitte",
"lor_02": "Friedrichshain-Kreuzberg",
"lor_03": "Pankow",
"lor_04": "Charlottenburg-Wilmersdorf",
"lor_05": "Spandau",
"lor_06": "Steglitz-Zehlendorf",
"lor_07": "Tempelhof-Schöneberg",
"lor_08": "Neukölln",
"lor_09": "Treptow-Köpenick",
"lor_10": "Marzahn-Hellersdorf",
"lor_11": "Lichtenberg",
"lor_12": "Reinickendorf"
}
The counts_per_age_group
objects are similarly structured, with a key for each age group, which in turn contain the case_count
and incidence
(no recovered
).
The age group 80+
was split into 80-89
and 90+
beginning May 11th (2020-05-11).
There is a special unknown
age group for which the incidence
is always n.a.
Some of the earlier press releases had a slightly different format, or were only available as screen shots (true story), so the old press release scraper did not work for them. Rather than writing special code for extracting these one- or two-off cases, I manually extracted them and put them in data/manual/manually_extracted.json. When creating the complete timeline, this manually extracted data was then merged with the newly scraped data.
The traffic light and vaccination data generated by the scraper is a JSON file located in data/target/berlin_corona_traffic_light.json. This data contains a timeline of how the traffic light indicators changed over time.
Starting February 15, 2021, the Corona dashboard includes data on vaccination. This has been added to the traffic light JSON file. While it is not strictly speaking one of the traffic light indicators, it is still an important indicator for the overall situation regarding the pandemic in Berlin.
There is a second JSON file in data/target/berlin_corona_traffic_light.latest.json which always contains the latest traffic light indicators.
The structure is as follows:
[
{
"source": "https://www.berlin.de/corona/lagebericht/desktop/corona.html",
"pr_date": "2021-09-02",
"indicators": {
"basic_reproduction_number": {
"color": "",
"value": 0.0
},
"incidence_new_infections": {
"color": "yellow",
"value": 83.2
},
"icu_occupancy_rate": {
"color": "yellow",
"value": 5.4
},
"change_incidence": {
"color": "",
"value": 0.0
},
"incidence_hospitalisation": {
"color": "green",
"value": 1.3
}
},
"vaccination": {
"total_administered": 4503235,
"percentage_one_dose": 65.2,
"percentage_two_doses": 60.4
}
},
...
{
"source": "https://www.berlin.de/sen/gpg/service/presse/2020/pressemitteilung.982682.php",
"pr_date": "2020-08-30",
"indicators": {
"basic_reproduction_number": {
"value": 1.1,
"color": "green"
},
"incidence_new_infections": {
"value": 12.5,
"color": "green"
},
"icu_occupancy_rate": {
"value": 1.5,
"color": "green"
}
}
},
...
]
The structure of the data is a JSON array with objects for day.
Each day specifies the source
(where was the data scraped from – this used to be a particular press release, now it is always the dashboard), the pr_date
(date when this particular set of indicators was announced – this used to be the date of the press release), an indicators
object and a vaccination
object (starting 2021-02-15).
indicators
in turn contains the indicators incidence_new_infections
(incidence of new infections per 100,000 inhabitants per week) and icu_occupancy_rate
(the ICU occupancy rate in %: which percentage of the available ICU capacity is currently being used).
On 2020-11-11 another indicator was introduced: the change in 7-day incidence ("Veränderung der 7-Tage-Inzidenz").
This indicator is included as change_incidence
(the number shows the change in percent).
change_incidence
was dropped again on 2021-09-02 (still included as 0.0
for backwards-compatibility) and replaced with a new indicator incidence_hospitalisation
.
basic_reproduction_number
(basic reproduction number R) was recorded until 2021-07-22.
After that, it is only included as 0.0
, in case applications rely on it to be there.
Each indicator has a numeric value
and a traffic light color
-code (one of [green
, yellow
, red
]).
For the exact meaning of color codes please refer to the corona dashboard.
Note that the meaning of the color codes was adjusted on 2021-09-02.
vaccination
shows the total number of administered doses of COVID-19 vaccinations, as well as the percentage of the population that has received one or two doses, respectively.
The scraper runs automatically every day, several times starting midday (around the time when the dashboard is usually updated). Previously, the scraper ran in the afternoon. This changed on 2021-05-07, when the update times for the dashboard moved to noon the following day.
To run the scraper at the specified times, I have defined a workflow in .github/workflows/scraper.yml for GitHub Actions, GitHub's continuous integration framework. The workflow
- sets up a virtual machine,
- checks out the repository,
- runs the scraper and
- commits and pushes the updated data if there are changes.
Setting up the workflow was surprisingly easy, so I'd definitely recommend this if you want to regularly run a scraper and don't want to push the button yourself every day!
@jaimergp and later also @graste recommended doing this, and I'm very grateful for the inspiration! I had no idea gh-actions included a cron-based trigger that makes this possible ...
@simonw's blog post at https://simonwillison.net/2020/Oct/9/git-scraping/ is a very good starting point if you want to learn how to do git-scraping.
If you want to run the scraper yourself manually (maybe you want to improve it), you can. Here is how:
- Ruby 2.7.1 (most likely works with other versions too, I haven't tried)
- the Nokogiri gem
- the jq JSON processor (jq is not a hard requirement anymore – it's only used to extract the current day for data/target/berlin_corona_traffic_light.latest.json).
- First, make sure you have Ruby installed.
- If you have the Bundler tool, you can install the gem (Ruby library) dependencies like this:
$ bundler install
# ... output from bundler ...
- If you don't have bundler (you should get it), just install the dependencies individually with
gem
. Since there is currently really only one non-standard gem dependency, this is no less convenient:
$ gem install nokogiri
# ... output from gem ...
- Finally, download this repository to a place of your choosing, or use
git clone
.
There is a Makefile that orchestrates the scraping.
The targets should be easy to understand (each one has an echo
statement that verbosely says what it does).
To create the complete timeline for both case numbers and traffic light indicators, use the all
target.
You should get output like this:
$ make all
deleting temp folder ...
creating temp directory ...
running corona dashboard parser ...
I, [2020-09-24T22:17:27.647621 #99403] INFO -- : reading current case number file data/target/berlin_corona_cases.json ...
I, [2020-09-24T22:17:27.664712 #99403] INFO -- : reading current traffic light file data/target/berlin_corona_traffic_light.json ...
I, [2020-09-24T22:17:27.666040 #99403] INFO -- : loading and parsing dashboard from https://www.berlin.de/corona/lagebericht/desktop/corona.html ...
I, [2020-09-24T22:17:30.703220 #99403] INFO -- : getting publication date ...
I, [2020-09-24T22:17:30.712117 #99403] INFO -- : publication date is 2020-09-24 ...
I, [2020-09-24T22:17:30.712160 #99403] INFO -- : extracting case number data ...
I, [2020-09-24T22:17:30.733268 #99403] INFO -- : extracting traffic light data ...
copying data from data/temp/berlin_corona_cases.json to data/target/berlin_corona_cases.json ...
copying data from data/temp/berlin_corona_traffic_light.json to data/target/berlin_corona_traffic_light.json ...
extracting latest set of traffic light indicators from data/target/berlin_corona_traffic_light.json ...
writing to data/target/berlin_corona_traffic_light.latest.json ...
write current date ...
update README.md with current date
Initially, I had written a Python-based scraper for the daily press releases, using the Scrapy Web-scraping framework. When SenGPG stopped publishing those press releases, the scraper was no longer functional. For a few days I "scraped" manually, until I got a new scraper for the dashboard running. The new scraper is Ruby-based and uses Nokogiri. This doesn't mean I now think Scrapy is bad – far from it, it's great! It's just that for the task of scraping all individual Corona press releases, starting from a paged index of press releases, Scrapy has some pretty handy functionality. Also, I just wanted to try it out. For simply scraping the same dashboard page every day, Scrapy seemed overkill, and I have more experience using Ruby and Nokogiri. It was just easier for me personally.
If you're looking for the old Scrapy-based scraper, you can still find it in release 0.2.4.
- virus logo by FontAwesome under CC BY 4.0.
All software in this repository is published under the MIT License.
All data in this repository (in particular the .json
files) is published under CC BY 3.0 DE.
I do not make any claims that the data in data/target/berlin_corona_cases.json is correct! If you find bugs in the code or in the data, please let me know by opening an issue here.
2020, Knud Möller
Repository: https://github.com/knudmoeller/berlin_corona_cases
Last changed: 2023-08-24