Python module to get the yearly UTC document registries, UTC meeting minutes and UTC actions, since 2000.
This module has various functions for interacting with UTC document registry pages since 2000, including:
- retrieval of the document registry pages and massaging the content of each page to derive the yearly registry as a a list of lists (list of rows, each a list of cell values);
- searching for text in the subject field of the registry for a specific year or all years;
- retrieving all of the UTC meeting minutes pages from all years since 2000.
- extracting motion, consensus and action-item details from the minutes of a given UTC meeting or all UTC meetings (2002 or later)
- searching for text (regex patterns) in UTC minutes pages.
To avoid repeating page retrievals on each use, or repeating other slow operations like processing the raw HTML pages, HTML page contents and other results are stored locally using the Python pickle module. If the .pickle file for pages or other content isn't present, the slower operations will be run and a new .pickle file will be generated. When the module is loaded, the registry page for the latest year will be retrieved to update the local cache.
The module has a hard-coded list of URLs for the yearly UTC document registry pages. (Actually, it's a dictionary: {year: url}.) That will need to be maintained year by year to add additional years.
There are functions to update certain data, such as the .pickle file with the raw HTML pages. These functions only update for the most recent year (since the UTC doc register for the current year is live and frequently updated), or for recent years that are missing from the data (as the list of per-year URLs is updated).
This module relies on some packages that not typically bundled with Python distributions:
- reguests: provides high-level HTTP support, used here to get pages
- BeautifulSoup: provides support for parsing HTML content
- lxml: low-level XML and HTML parsing support, utilized here in conjunction with BeautifulSoup
Dependencies are captured in the requirements.txt file and can be installed using the following command line:
python -m pip install -r requirements.txt
Creation/activation of a virtual environment before installing dependencies is recommended. E.g.,
python -m venv venv
venv\Scripts\activate
If a bs4.FeatureNotFound error is encountered regarding lxml, try uninstalling then re-installing lxml:
pip uninstall lxml
pip install lxml
This module makes it easy to get a compilation of actions from UTC minutes from meeting 90 to the most recent meeting minutes in the doc registry. Here are steps:
- In a terminal, navigate into the UTC_Actions subfolder from the project root.
- Activate the virtual environment (if created).
- Launch a python REPL.
- Execute the following:
>>> import utc_actions
>>> from utc_actions import *
>>> writeToFileTaggedActionsFromAllMinutes("UTC-actions_90-187.txt")In that last line, the file name assumes that meeting 187 is the most recent; adjust as appropriate.
The following lines illustrate how to get a compilation of only decisions, action items, or recorded Notes:
writeToFileTaggedActionsFromAllMinutes("UTC-decisions_90-187.txt","decision")
writeToFileTaggedActionsFromAllMinutes("UTC-actionitems_90-187.txt","ai")
writeToFileTaggedActionsFromAllMinutes("UTC-notes_90-187.txt","note")It's possible to extract details for just a specific meeting or range or meetings, but that would require some extra coding. It's easiest (and fast) just to get details for all meetings and then use only the portion of the .txt file that you want.