Scraping gender distribution data on Skoob
About a month ago I published a data analysis article with the demographics of all the books I have read, and today I’ll dive deeper into how that was accomplished!
Searching for an existing API
When I noticed that every Skoob book page had the distribution of readers by genres, I immediately knew I had to mine it to see it correlated to my reading activity. The first obvious step was to see whether Skoob had an official public and documented API, but it didn’t 1. My second best alternative would be if someone had already made an unofficial API to use, which is when I found Fernando Arruda’s Skoob API. The project documentation described it could fetch user bookshelves and individual book data, so I thought: “it is going to be a walk in the park!”.
Studying the Book object returned by the API, I noticed it had a lot of data about the book, but not what I wanted (the breakdown of male and female readers), hence I had to dig deeper to see how it was fetching the information. I’m not a TypeScript person, thus I was lucky enough to notice that there was one open issue questioning why the API needed cookies. There I found out that there was sort of a Skoob API that the project was using, but Fernando had already mapped everything it returned into his objects and that wouldn’t be my way out.
Writing my own script with Python Requests
Half frustrated by not having ready to use code and half excited about having to write my own, I quickly created a Python script and said: “time to get my poor Python Requests
knowledge to work!”. For the unaware it is a very powerful package that allows you to make get
and post
requests to a page. It is very versatile and I was very confident that I would be navigating the pages in no time, using something like:
1import requests
2payload = {'email': email, 'password': password}
3session = requests.Session()
4session_request = session.post("https://www.skoob.com.br/login/", data=payload)
I ran the script and got it to work2 🥳, now it was just a matter of navigating to the bookshelf to start collecting the book pages and then their data, but I just couldn’t make any progress 🫠. I started fiddling around with Chrome DevTools to see the page’s HTML and saw it was sprinkled with ng
tags, which are used by Angular to build a dynamic page3. This meant that my code would never be able to get the actual page data, since it relied on the fact that upon doing a request, the page returned would have been completely loaded. Here is how the script looked until I abandoned it.
Writing my own script with Python Selenium
I started googling around to see whether Requests
would still be able to accomplish this task and all I got back was “use Selenium
”. Using it is not so different from Requests
, in the sense that you are still doing scattered get
calls, but there are three key differences:
- It has built-in functionalities to wait for a given element to appear on the page before moving on.
- Every time you search for its examples on the web your first result will be a code snippet in Java, and it took me a while and several Python exceptions to realize that 🫣.
- It actually runs on the foreground: you see the browser pop up and navigate on its own, like if you had recorded a series of steps to be executed automatically, which is somewhat rewarding after you’re done coding, despite being slower4.
Point #1 was a game changer, as now I could seek parts of the HTML page that were only there after it finished loading and use those as my checkpoints before retrieving the data, here are a few examples:
1# Wait until being redirected to the user home page after logging in
2WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.ID, "livro-perfil-status02")))
3# Wait until first book cover appears, which would mean pagination had been completed
4WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".livro-capa")))
Note that it allows one to locate elements using different HTML traits, which was very important for me, as I could not always rely on an element having an ID. This same technique could also be used to find data within the page, once it finished loading, instead of having to manually parse it using regular expressions or similar:
1# Construct list of book pages
2book_links = driver.find_elements(By.CSS_SELECTOR, ".livro-capa")
3read_book_pages.extend([elem.find_element(By.TAG_NAME, "a").get_attribute('href') for elem in book_links])
4# Find next page button to click
5next_page_button = driver.find_element(By.XPATH, "//*[contains(text(), '›')]")
With that it was just a matter of running the script and waiting for the magic to happen 🪄! There were a lot of bits that could be improved: places I couldn’t put a proper wait hook and instead relied on the script “sleeping”, some hardcoded ids, but I didn’t want it to be a fully generic script to share, I just wanted to get my data.
After finishing all the scraping, I started playing around with Matplotlib
on the same script to plot the histogram and the markdown table I used on the post. For the histogram, it probably would have been faster if I had used Excel or Google Sheets, but I like spending time fighting learning about matplotlib. At this point, since I didn’t want to fetch the data over and over while I adjusted the histogram axes, colors and so on, I added some JSON
caches of the data into the script, so I could fast-forward the slow part of running Selenium
5.
The final code is not pretty, but it worked and gave me some very good insights and a fun coding afternoon, I’m looking forward to my next opportunity to use Selenium
and am glad to have added it to my belt of tools.
There is no date on that FAQ entry. According to the Internet Archive’s Wayback Machine the page used to say “We have, but it is not published yet” in January 2016, and at some point between May 2022 and December 2023 it changed to “We don’t, but we’re checking if we can have one” 🤔. ↩︎
Not immediately, you know, otherwise it wouldn’t have built excitement. ↩︎
Luckily I had a very brief contact with Angular when we ran the Curriculo para Elas project during the COVID-19 pandemic, which helped me realize what it was quickly. ↩︎
It has the bonus of getting a puzzled look from your wife as you stare at the screen with a self-driving web browser 🤣. ↩︎
The best solution likely would have been to do the scraping and the plotting in different scripts, but I was lazy. ↩︎