Project: M.net listings aggregator

2020-08-23

Goal

At some point this summer I noticed that I spend too much time browsing through listings on Muusikoiden.net’s tori (a Finnish marketplace for used music gear). Of course, as a programmer the first thing that comes to mind is automate things, so I did. The goal was to develop a program that scrapes new listings from the site and sends a daily summary of them via email.

Development process

I started by looking at what web scraping libraries were available for Haskell. I took a look at Beautiful Soup just in case it would blow me away by elegant API-design and be an excuse to get more into Python. Beautiful Soup’s documentation is huge and it discouraged me somewhat. Surely it does a lot more than the scalpel package that I ultimately chose, but for a project with a very limited scope the latter seemed more appealing. With hindsight, a Python project would have been easier to deploy.

Scalpel’s API is very declarative, so sometimes it was very hard to follow why it scraped the things it did. I had to stare at the package’s documentation’s examples for a couple of hours before I got a hang of it. Muusikoiden.net’s HTML structure is from early 2000s so the layout is heavily based on HTML tables and that surely didn’t help either. Still, I’m pleased at how the web scraping code turned out modular, easily composable and pretty succinct.

With web scraping part out of the way, I implemented logic for filtering already seen listings and how to render the scraped data back to HTML. I used blaze-html for the latter and it was easy to use and integrated nicely. Out of curiosity I also tried out Clay, which is a CSS preprocessor implemented as an EDSL in Haskell. It’s type-safe way of writing CSS was nice, but compiling every time to see the differences wasn’t very developer friendly. Maybe Clay’s advantages come forth better in larger projects than mine. I didn’t put too much effort in making the layout beautiful. I’m all about functionality when making things for myself 😇.

Sending email was as easy as it should be. Most of it was just defining the configuration. For configuration, I used Dhall, which is described as a programmable configuration language, though I didn’t really exploit the programmable features. I used it because it integrates really well with Haskell. I didn’t have to hand-roll any logic for dealing with configuration files. Instead, I had to just describe the configuration file’s structure and data types.

The program was coming together and I began thinking about deploying it. Initially I thought that the program would run on my Raspberry Pi and it would be invoked daily with a cronjob. But it turned out that the Haskell compiler support for ARM architectures was very limited and getting Stack (the Haskell build tool I use) to work on it was beyond me. This is the only point in the development process that I regretted choosing Haskell.

It was clear that I couldn’t host the program myself, so I started investigating how to utilize Google Cloud’s infrastructure to make it work. I ended up refactoring the program to be a web app so that it could be deployed in a Docker container and then invoked periodically with Cloud Scheduler’s request. Cloud Run is server-less, so I couldn’t depend on just dumping the listing numbers in a file, so I had to use a database for persistence. I chose Redis because it has a really simple API. The silver lining of all this is that now the program didn’t depend on my hardware so up-time will probably be better. (Raspberry Pis aren’t known for their reliability).

Final thoughts

It took a while to develop this project and the process wasn’t without setbacks, but I’m pretty confident that the time investment will pay out eventually. Nowadays I don’t really browse Muusikoiden.net anymore. I just wait for the summary to arrive in my inbox. I feel like while coding this project in Haskell strengthened my grasp on using monads and ability to figure out libraries.

Source code on GitHub

Some useless statistics:

---------------------------------------------------------------
 Language      Files    Lines     Code     Comments     Blanks
---------------------------------------------------------------
 Haskell          13      465      310          111         44
 Markdown          1      107      107            0          0
 Cabal             1       87       76            5          6
 Dockerfile        1       54       27           14         13
 YAML              2       69        7           57          5
---------------------------------------------------------------
 Total            18      782      527          187         68
---------------------------------------------------------------
Updated 2020-08-23
Back to Projects