Books Success by Awarding Prediction

Table of contents

  1. Introduction
  2. Imports
  3. Data acquisition
    3.1 Scraping challanges
    3.2 Scraping clean data
    3.3 Authentication process
    3.4 Authentication class
    3.5 Scraping Process
    3.6 Book Spider Class
    3.7 Scraping route creation
    3.8 Genre spider
  4. Scrapping and threading
    4.1 First crawl
    4.2 Concating Data
    4.2 Total data scraped
  5. Data cleaning
    5.1 Corrupted data cleaning
    5.2 Replace missing data - original title
    5.3 None values - discussion and strategy
  6. Pre outliers cleaning EDA
    6.1 Genre distribution
    6.2 Mean rating by genre
    6.3 Language distribution
    6.4 Edition count to rating
    6.5 Rating to award
    6.6 Pages count to books count
  7. Dealing with outliers
    7.1 Outliers detection
    7.2 Outliers cleaning
    7.3 Outliers cleaning results
  8. EDA after outliers cleaning
    8.1 Thoughts of the results
    8.2 Aggregation metrics
    8.3 Original title correlation with awards
    8.4 Awards count per genre
    8.5 Awards percentage by genre
  9. Machine learning preperation
  10. Machine learning - Decision tree
    10.1 Single decision tree
    10.2 First prediction
    10.3 New dimenstion - The ace in the sleeve
    10.4 Depth optiomazation
  11. Machine learning - Random forest
    11.1 Overfitting?
    11.2 Model improvment
    11.3 Adjusting features
    11.4 Grid search many forests
    11.5 F-score accuracy addition
    11.6 Random states tests
  12. Conclusion and credits

Introduction

Today the movie industry is one of the biggest industries in the world.
Full of many actors, producers, films and even a lot of rating websites such as: IMDB, Rotten tomatoes, Metacritic and etc.
As the film industry growth we felt that something was forgotten here and thought to ourselves: What about the books industry and those great books which also in days became the scripts of many successful movies?
This thought led us to ask ourselves: assuming we were a huge books publisher and a writer came to us with a book, how could we know if this book will be successful?
Also if we were to be the authors of the book, could we ever know if the book will get the audience sympathy or even reach the cinema theatres?
To answer the following questions we set a goal to our research: to see if we can build a model that will predict if a book is so successful that it will also be awarded by list of books features.

Import needed packages

Data Acquisition

In order to create a predication model we first needed to gather relevant data.

Considering our options of data acquisition sources, we decided to look for the biggest library catalogs and reading lists websites and scrape data which we thought will be helpful and save it as a dataframe

The top options we found were:

After a punctual search in these websites for relevant data and great community scale, we have decided to move on with good reads due to the scale of the community and amount of features provided with each book.

We started looking for a way to scrape data from good reads and we immediately faced the issue of having to authenticate and make all network requests to their servers under the same authenticated session

We used authentication credentials, so we stored the credentials to a .env file locally and load the varialbes by name instead hardcoding them.

Scraping challanges

While we trying our first attempts to scrape the data from good reads we found out the following issues:

The way we dealed with these issues was by created saftey methods for each chllange.

Some util functions to help scrape the data without harming the the 3rd party websites and retry mechanism in case of failure

Scraping clean data

While scaping we found out that we get some of the features scraped with defected such as:

In order to resolve these issues we created a converter which extract the data we wanted precisely

Authentication process

At the beginning we were trying to understand what is needed from us in order to authenticate to the good read servers.
We were looking at the requests sent to the server just by using the website as normal user and logging in to an account.
We could see that besides the username and password which were sent in the request there were 2 more fields sent:

After deep investigation at the HTML tree we finally found these fields which were both under an hidden type input tag, this was made in order to prevent the "casual" user finding them out.
Once we did find these fields we were able to create a class to automate the authentication process.
Note: These days the class will not work due to a massive change to the website done in the 19th of April 2022 which included changes to the authentication service - You can still see the old website here at the 18th of April 2022

But why do we need authentication in a free to use website?

Apparently the website was free to use but it was not functioning properly while the user is not signed in to the website
Here is a good example that shows how the shelves page is not showing the right books while the user is not signed in (we will talk about shelves later on).

Authentication Class

Used to store authentication persistent data inside of a requests session, to allow private scraping

Authenticating...

Scraping Process

Once we finished with creating both the authentication class and the utilization functions we were right in the spot were we felt confident to start building a spider classes which will scrape the data for us automatically.

At first glance at the website we thought to ourselves, just how exactly we want to scrape the data?
So went in to the website and started exploring the website in order to find the best route for our scraping proccess.

We could clearly see that our best option to get maximum data was in the book's page which contains huge amount of data.
Right on first sight we could mark ourselves data which we knew we are going to use.
The elements we saw right away:

</p> At this phase we asked ourselves 2 questions:

  1. What is the exact data we wish to scrape?
  2. How can we easily access large amount of books which will present different types of books.
To move forward we first knew we need to know what exactly we want to scrape for each book.
We could easily get from the book's page the following list:

But this was not enough information, we were not satisfied with that, so we continued to look for some more books and that's were we found some books had a field called "original name" - We had a feeling that this field my present a good change, where book did not succeed in onr name but did on another, therefore we added this field to the list.
Yet we knew this was still not enough data, so we explored pages connected to the book's URL page to see what data is available from each book page.
The first thing we wanted to have is connection to the author, we knew that book's success has to be somehow also related to the author that wrote the book.
Luckily for us - The name of the author was also link to the author's page. </p>

The exploration in the author's page then began.
While we ere exploring the author's page we found the following data which we decided can be helpful for us in order to create a good prediction method.

Now felt more confident about the data we have, but then we recalled one field that caught our interest in book page, which it was the "Want to Read" button at the book's page.
We wandered, is there any way we could get the amount of people interested in reading the book?
Back in the book's page we pressed the "Want to Read" button, that revealed a new field of interest to us - "Currently reading".

But still, we did not knew how to get to the number of how many users want to read the book or currently reading the book.
So we kept exploring the book's page up to the point where we pressed the "see top shelves..." hyperlink, there we could finally get the numbers we wanted, Since both the "to read" and "currently reading" were actually tagged in the shelves.

Great!

Now have all the data we want to scrape and we know the route from a book's page to the other pages which we can get the data from.
That concludes both the first and the second questions we were asking ourselves.
This is where we decided to create the first spider that scrape the data we agreed on.
We inspected each and every elements we were looking for in the HTML tree of each page and the result led to this class:

</id>

Book Spider Class

The book spider class is a tool which gets the book's url and the authentication session we are using and scrape the data from the book's page as well as from the author's page and the top shelves page (using another http request if needed for the other linked pages to the book's page).

Remider of the list of data and where are they scraped from:

  • title - book page
  • original title - book page
  • rating - book page
  • ratings count - book page
  • reviews count - book page
  • book format - book page
  • pages count - book page
  • main language - book page
  • editions count - book page
  • has awards - book page
  • top tagged genre 1 - book page
  • top tagged genre 2 - book page
  • top tagged genre 3 - book page
  • to read count - books top shelves
  • currently reading count - books top shelves
  • author name - author page
  • author followers count - author page
  • author books count - author page
  • author avg rating - author page
  • author ratings count - author page
  • author reviews count - author page

Creating the route for successful scraping

Now what left was just to find the way to access a large amount of different types of books.
So we went on and thought about what could be the best way to get different types of books? Genres of course!
Great, We know how to get a great scale of books, So we decided to check how does the genres page looks like when exploring all the genres.

This is the page we found out there.

Inside a genre we could see the access to book links but we were not satisfied with the way we can access book link from the generes page - The coverWrapper class was too generic and we had no access to large amount of books.

Exploration of the genere page led to this button:

Which this link provided us with the solution to all of our problems.
This page's URL built from easy to explore api links:

Example for a complete URL - https://www.goodreads.com/shelf/show/horror?page=2
Moreover, accessing a book's page was easy from this page. Last but not least we knew exactly how many books to expect each page since each shelf page holds up to 50 books per page.

Genre Spider Class

Once we found out the right way to start scraping the data, now what's left is to create a genre spider which will handle browsing inside genres and help us getting the books easily.
We built a Genre spider which gets a genre from a list of genres we decided, the spider handles the mechanism for fetching pages by genre and using books scrapers to store books data.

Scraping - Let's start scraping

But before we started scraping there was something we wanted to improve in our scraping model.
Because scraping is a lot of IO and different genres are not dependent, in order to have better network utilization we have decided to use multi-threading.
We did not want to overload the site and harm other users experience with over-requesting, however we wanted to decrease our scraping time by scraping multiple genres at the same time.
But what is the right number of threads that will remain balance between faster scraping and not causing over-requesting from the website which will also delay the server's response time?
The best practice was by doing "trial and error" here are the trials we did:

In conclusion we ran 5 threads which were scraping 5 different genres at the same time.

First Crawl Conclusion

As can be seen, in the logs above, although at the start, we have set boundary of 1000 books at max for each genre, it seems like some of the genres are having just 6 pages and such.
You can see it below, as the travel books rows count is below 1000 (below 20 pages)

First Crawl Conclusion - Continue

In addition, after further investigating our scrape target website, we see that all genres have up to 25 pages (1250 books) to browse (AT MAX). Therefore, We are going to run the scrape flow again so we fetch every book we are able to (by selected genres), but this time change the genreSpider class to support offset, so it won't re-read the same books again, AND we will append the books this time to the existing parquet file, instead of overwriting them... other than that, same approach. NOTE: original Class as been updated.

Now, let's concat all those genres dataframes into one single books dataframe

Let's check how much data did we scrape before we start cleaning it (if needed)

Great!

We finally scraped the data we were looking for we are now ready for the next step.

Data Cleaning

After the scraping process we could easily tell that some data is duplicated.
First, Let's clean these duplicate books.

As can be seen we got rid of 559 rows of duplicated books just by using a built in pandas function called duplicated.
For more information see documentation: Here

Clean Corrupted Data

Now that we know duplicates are gone, let's take a look at our DataFrame and see what defects in the data do we have.

As can be seen, author_books_count max: 192,863 is too high, let's check what's the cause.
Approaching to this issue done by looking who is exactly this author with a huge amount of books.

Result was Anonymous, Pretty suspicious but we still didn't want to remove books which have no known author before we know how much will it affect our statistics.
let's look at the data of the books marked with the Anonymous author.

As can be seen, there are a lot of coloums with suspicious values which may harm our outlier detection,
i.e: the median for author_books_count actually should be very low in compared to that.

We were curious to see how many books out of the 17,228 books we have right now actually marked as Anonymous author.
Let's count how many books we have which are marked as Anonymous author.

Not many books, 33 books out of 17,728 are not 1% out of the amount of books we own.
Let's remove them.

Good, No more Anonymous authors, Let's re-check our DataFrame info again.

Again we found out the top number of books count is too high and stands at: 162,128.
That led us to do the same flow we did with Anonymous author but this time with Unknown author.

Even less than previous cycle, 4 books total.
Let's remove them.

Let's take a another look at the info of the books count.

Success

We finally got to the point where the number of books related to the author is reasonable and moreover the author is one of the most popular authors ever lived (William Shakespeare).
Note: We did notice the enormous amount of editions count for our new top books count author, but it is a legit outlier
So up next we thought to ourselves, orinial title field might be somewhere where a lot of "defected" data can be there.
Let's take a look there.

Replace missing Original title

The numbers were speaking for themselves, we had around 2k books without original title.
So, maybe we want to save the 2k books which just don't have original_title.
logically, as we know our data, books with null og_title are just books that their title never changed.
Instead of having this data in that way ,we can:
1. Fill og_title with current title if original title is missing.
2. Create new flag col is_title_changed, which may perform in our later ML model.
Let's do it!
First we add the title to all missing original_title books.

Now let's add the new is_title_changed flag mentioned.

As can be seen, only 4,576 books out of the list has their title changed.
This time what really caught our eyes in the info was the number of NA main_language, which we could see was not slightly lower than the title amount.

The numbers were not adding up

Because of the main_language NA number we faced we raised to ourselves the following question?
Just how much data we actually own that has no NA values.
Let's check how many total books are there without any NA values.

Out of 17,191 books we had, only 16,205 books are NA "free".

Facing NA values strategy

Logically we knew not all NA values must be removed, but we also knew we should get our DataFrame rows to be aligned.
We also knew our goal was to get somehwere around 16,205 books, since this was the number of the books without any NA values.
Our strategy was to divide and conqurer each NA data per coloumn.
We started up by cleaning the main language which we already knew had NA values.

As can be seen, there are alot of different languages (21, None excluded) Which are widely used across the world, so we decide that because:
1. There are few missing Languages
2. Picking just the most frequent one to fill null values is un-relible
We have chosen to drop the rows with missing main language.

The next NA values we dropped were top_tagged_genre_2 and top_tagged_genre_3.
As can be seen, there are just 4 books which are missing second and third tag (out of 16k).
let's drop it.

As we were thinking the book_format and the pages_count are important and we don't want to drop it,
We were also thinking that replacing the followings will still keep our data relible:

  1. median/mean for the pages_count - Very common statistical approach.
  2. top for book_format - Since it was string format we thought the commonest is a reliable option.

Mean or Median?

In order to decide in which metric to replace page_count (mean/median) we want to replace NA values we wanted to base our decision over the distribution of the metric.

As can be seen, the distribution can show us that there are many outliers (i.e: some having 14k when the most has less than 1k) So we are going to use median (mean might be too high). And as can be seen, the pages_count has wrong dtype, float instead of int, let's fix it too.

Let's fix null of book_format, using top value as mentioned before.

Great

We don't have any missing values anymore

However, We didn't finsih our cleaing session yet, we still may have some outliers (we sure do as we have seen in the historam for pages_count) to be cleaned.

Pre-outliers cleaning EDA

At this point we knew we had to get rid of outliers, we thought it will be the best to check the outliers and show them by doing some EDA.
Also, EDA shows us the current state of the data.

Genres distribution

As can be seen, most of the books are either Fantasy or Nonfiction books.

Mean rating by genre

As can be seen, some of the genres have average rating higher than the others.
Maybe maybe it could help us later in building a model.

Language ditribution

Not to our surprise, by far the most used language is English, maybe it's related to the fact it's the most widely used language in the world.

Edition count to rating

The idea behind this graph was to check for possible correlation between the two.

let's take a closer look at the graph.

As can be seen, most of the books which having awards are having less than 250 editions (which makes sense).

Rating to award

Another EDA which we thought will help us building a prediction model later on.

As can be seen, altough books has atleast 2.42 rating, below the rating 3,there is not even a chance for getting award.
This information is great for us.

Pages count to books count

As part of checking some more ideas of what can lead to outliers and help the prediction model we decided to check the books pages count and we then saw this.

This kind of data is exactly why we need to remove outliers.
This leads us to show all the outliers which we might have, based on the data of each coloumn individually.
Anyway, altough because of the data spewing, we probably choose median over mean (which may be very high because of the outliers).

let's have a look on both

It seems like the median is closer to the peak area more than the mean, however let's zoom in even further

Now, as it can be seen: the median really is seem to be more common (even with the amount of records we have)

Dealing with outliers

In this secion, we are going to:
  1. Detect Outliers in all of our numeruic data and present them by using Boxplots.
  2. Using IQR which is going to replace all numeric cols items which their values are below the 25 percentile or above 75 percentile with median.

Outliers detection

Outliers Cleaning

At first we wanted to do replacement of the values under Q1 and values above Q3 will be replaced with median.
Yet, this approach would have been unfair for the great authors such as: J.K rolling and William Shakespeare which earned their crowd population respectfully, By replacing their values with medians we would have ruin the data and may also harm our future prediction model.
So in order to keep our data as realistic as we can and at the same time remove outliers we decided to use capping instead of medians, that way the great authors will not be downgraded by too much.
Another approach we wanted to try out was z-score and mean, reason doing that was since we knew the z-score model keeps around 99% of the population as non-outliers (in case of normal distribution) it would detect "real" outliers.
So in order to decide which approach's data we want to use for the prediction model we decided to try them both and compare the data's skew after each of them.

As expected, capped IQR seems to be the best when it comes to skewing consideration

now we can proceed.

Let's visualize again the problematic pages_count metric after outliers cleaning

Outliers cleaning results

After the clean-up, Here are the coloums data without outliers.

Data cleaning finished

We are done with the cleaning process, le'ts save it

After cleaning EDA

Now when the data is clean it's time to see how far we have accomplished by now and show all the stats we have based on the data acquired.

Thoughts

Based on what we saw in the description of the DataFrame we were happy with the results.
All the numbers were adding up to reasonable numbers and the percentiles values for each data coloumn are exactly how we wanted them.
Also, just by looking at the DataFrame as it is, we started having guesses of how to predict if a book can be awarded or not.
Overall we were very satisfied with the results.

Metrics aggregation by getting awarded

We wanted to see how many books which has some metrics aggregated has awards or not.
Thus we decided to display these metrics.
The aggregation model is Ratio of:
Sum of awarded books based on the metric / Total awarded books.
Note: we had to exclude rating and author average rating from the list due to the aggregation model lack of ability to sum a bounded value.

Changing titles effectivness

The question of how effective is it to change the name of the book was interesting to us from the beginning of this research.
Now we can finally check the correlation between changing the title to getting awards.

As can be seen, there is no correlation in between these two, not all that is shining is gold.

Count of awarded books per genre

Percentage of awarded books per genre

Machine Learning model - Preperation

Finally, time to make some predictions.
The method we wanted to use as our machine learning was decisions tree.
Reasons we chose this method were:

Yet, we still had to do some data modifications before creating a machine learning model that can predict if a book will be awarded.
First, We planned to eliminate coloums whatever not relevant to the prediction model:

After that we want to be able to use a boolean based features in our DataFrame as value based coloums.
These booleans are:

So in order to make it booleanish we deciced True value will be presented as 1 and False value will be presented as 0.

Last but not least, We had to do label encoings to all object type columns so they will fit the form for the machine learning model.

Terrific

Our DataFrame is now ready for machine learning model.
Let's save the DataFrame and create the prediction model.

Machine learning model - Single decision tree

Our model for decision tree is based on 3 important utils:

  1. The sklearn.tree module which includes decision tree-based models for classification.
  2. From the sklearn.model_selection we used train_test_split - Quick utility that wraps input validation and next(ShuffleSplit().split(X, y)) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.
    Which we wrapped with our own data splitter based on the features, labels and random state to our decisions.
  3. Rendring function which draws the decision tree.

Here you can see the splitting function and the rendering function.

First model prediction attempt

And here you can see the decision tree model with the prediction and accuracy of the training and testing of the model based on rating feature only.

Seems like rating is not enough, as suspected

in the next section, we are going to improve our prediction using both, more features and improved params.

The ace in the sleeve

That's better but seems like maybe we can create a new column which may help us

We are going to count how many books with awards the authors has and use it as an hint which may indicate for sucessful writing style.
The idea was to see if based on previous books the author was given awards the same author might get another award and that information will indicates a writing style that may be successful (As in other subjects such as movies where movies related to some directors are highly to be awarded).

Using the same aggregation graph we used before, You can now see that this "new" feature created by us has a lot of meaning to the prediction model.
Let's take a look at our new DataFrame.

Time to check if the new author awards count we added is helping us with prediction or not.

Great improvement

Our "Ace" did help with improving the test group prediction accuracy by 10%, which is highly significant.

Depth optimization

Using Grid Search we wanted to know if we can further improve our prediction accuracy.
The tests were done by using GridSearchVC from the selection module of sklrean.
The results were:

Note that we decided to move in the range of 2-12 depths and split range of 5-50 in the samples split due to optionality to have overfitting if we try any higher values.

Let's try to predict with the suggested max_depth

Slight improvement

We got ourselves a better training accuracy but the test prediction did not improve by a noticiable percentage.

Prediction model - Random forest of decision trees

Understanding that one tree was just not enough in order to predict if a book will get an award,
We decided to get better prediction results using random forest model.
Reasons to use this model:

100% of training accuracy, are we overfitting our model?

At this point we were afraid our model is overfitting and that our training methodology needs to be changed.
So we went on investigation to understand, is the model overfitted?
After some investigation we found out we did not necessarily overfitted our model.
Here is where we got the information which later on allowed us to understand we are not overfitting.

But can further improve the model?

In order to answer that question we decide to "eliminate" the features which are the least efficient in the prediction.
So we gathered the weights of each feature in the prediction.

The result is that is_title_changed was actually the least effective feature.
Let's remove the least affective feature (is_title_changed)

Less is more?

Removing the is_title_changed came up as a great idea which led to better accuracy.
But we still want to improve the model.
We can improve it a bit by understand the satisfaction of people by how many already left comment to how many read it and multiply them by author average ratio.
Let's create another "Ace" of our own.

We were right

We can use community satisfaction to our advantage, guess that the community has a part in awards giving decisions.
But was this new "Ace" an important part of the equation of the prediction?
Let's check the new feature's weight.

The new feature has higher weight in the equation than 2 other features, which means our feature can be acceptable in our prediction model.

Did we finally get the best results we could?

In order to check that we went on and created another test over random forest and many configurations to see which configuration will get the best results.
The configuration options were as list:

Reached the top?

So now we know the features combination that gives the best prediction but were we just lucky in the train/test split selection? or our model actually works?
In order to assure our model works we decided to do the followings:

  1. F-score - To determine the test's accuracy.
  2. Testing if there is another dataset randomness which will get similar results.

Prooving our test results are valid by adding the f-score results of our tests

Checking list of random states of datasets

Conclusion

Predicting if a book will get an awards comes with the need of familiarity with the dataset (i.e : knowing who the author is may help with distinguish between legit record to an outlier and trigger appropriate action) and creativity to use existing data to create new features which can be used to improve the model prediction accuracy (i.e: using the knowledge of the author to create new dimension of data related to the amount of awards the author gathered).
However, after observing the results of so many predictions (decision trees) we think that our model can predict with high accuracy.

Credits:
some of the used functions were taken from Campus IL course (Intro for Data Science):
i.e:

  • visualizing the decision tree
  • detecting outliers using basic IQR
  • detecting outliers using z score
  • </u> Many thanks for sharing the knowledge.