LightFM Dataset Helper - Python package

March 06, 2020

LightFM Dataset helper

a lightweight python package to help preparing Dataframes (csv ... ) for LightFM module for easy training Training

Install

simply install with the Python Package Index (PyPI)

pip install lightfm-dataset-helper

or manually from released wheel

Example

imports the module

from lightfm_dataset_helper.lightfm_dataset_helper import DatasetHelper

Preparing the Dataframe and the required info

loading csv files

# using pandas to load csv files
import pandas as pd

def read_csv(filename):
    return pd.read_csv(filename, sep=";", error_bad_lines=False, encoding="latin-1", low_memory=False)

books = read_csv("Data/BX-Books.csv")
users = read_csv("Data/BX-Users.csv")
ratings = read_csv("Data/BX-Book-Ratings.csv")

Columns Definitions

items_column = "ISBN"
user_column = "User-ID"
ratings_column = "Book-Rating"

items_feature_columns = [
    "Book-Title",
    "Book-Author",
    "Year-Of-Publication",
    "Publisher",
]

user_features_columns = ["Location", "Age"]

Optional* for testing on small amount of data (500)

# just cutting down the amount of data to 500 for less time (making sure no missing data will be passed )
Test_amount = 500
ratings = ratings[:Test_amount]
books = books[books[items_column].isin(ratings[items_column])]
users = users[users[user_column].isin(ratings[user_column])]

Creating the helper instance

feeding the dataframes to the helper and running the routine

dataset_helper_instance = DatasetHelper(
users_dataframe=users,
items_dataframe=books,
interactions_dataframe=ratings,
item_id_column=items_column,
items_feature_columns=items_feature_columns,
user_id_column=user_column,
user_features_columns=user_features_columns,
interaction_column=ratings_column,
clean_unknown_interactions=True,
)

run the routine ,you can alslo run the steps separately one by one | routine function is simplifying the flow

dataset_helper_instance.routine()

after runing the routine we can feed the dataset to the LightFM class

from lightfm import LightFM

model = LightFM(no_components=24, loss="warp", k=15)
model.fit(
    interactions=dataset_helper_instance.interactions,
    sample_weight=dataset_helper_instance.weights,
    item_features=dataset_helper_instance.item_features_list,
    user_features=dataset_helper_instance.user_features_list,
    verbose=True,
    epochs=10,
    num_threads=20,
)

Model fitted successfully and the result with verbose=True,

Epoch 0
Epoch 1
.
.
.
Epoch 8
Epoch 9

Used Dataset

using books Dataset from here

The Book-Crossing dataset comprises 3 tables.

BX-Users
Contains the users. Note that user IDs (`User-ID`) have been anonymized and map to integers. Demographic data is provided (`Location`, `Age`) if available. Otherwise, these fields contain NULL-values.

BX-Books
Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (`Book-Title`, `Book-Author`, `Year-Of-Publication`, `Publisher`), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (`Image-URL-S`, `Image-URL-M`, `Image-URL-L`), i.e., small, medium, large. These URLs point to the Amazon web site.

BX-Book-Ratings
Contains the book rating information. Ratings (`Book-Rating`) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.

MIT license

github repo : link