I have been learning and competing in ballroom dancing. Some organizers use this "S3" software for scoring. It is able to generate HTML pages, so that organizers can publish competition result (scores and rankings) online. I wrote a Python 2 script to scrape rankings from it.
S3 generates a .ska extension raw data file. It is a machine readable format. Unfortunately it records raw scores, but does not list calculated rankings. In order to determine ranking for each event, I have to implement an algorithm to process the scores. I could do that, but I think scraping the generated HTML pages might be easier.
Lack of CSS class and ID selector. Most HTML parsers allow using class and ID to select elements. Unfortunate the generated pages does not use class and ID, it took me quite some time to specify elements to extract data. Table-in-table layout makes it worse.
I uses Python 2 and Beautiful Soup 4. Python is an excellent tool for text processing. Beautiful Soup is a popular choice for web scraping among pythonists. It is powerful and extremely simple to use.
The script loops through each event listed in homepage (e.g. http://bit.ly/2PEGpo0). Then it determines the "Global results" page based on its URL naming convention. Ranking can be extracted from "Global results" page (e.g. http://bit.ly/2PEILDh).
Suggestion for further improvements:
- Turn it into a library (package), so that it can be integrated into applications.
- Read input from command line argument, instead of hardcoding URL.