Convert IMDB data to MySql tables using C++ – Part 1

OVERVIEW

IMDB , being the world’s best known and largest movie and TV website, does not provide any API to use their data. However , they provide datasets of their entire database in TSV files (tab separated values). They contain all the data pertaining to movies, TV shows and their cast and ratings.

The dataset can be downloaded from https://datasets.imdbws.com/ The files can be converted into various formats and databases for any work to be done on them. The explanation of the format and fields are given here: https://www.imdb.com/interfaces/ . This data dump is updated daily.

Please note that IMDB forbids any commercial usage of this data. It can be only used for personal and non-commercial use-cases.

In this blog series, we look at how to convert the tsv files into a normalized Mysql database.

In Part 1 we will only capture all the raw data from the tsv files into mysql tables. In the next part we will look at adding indexes and optimizing the database to remove duplication and redundancy.

WHY C++

Processing the TSV files is not complex and can be done in any high level language like Python or Java or PHP. However, because the size of the files go into hundreds of Mb and a couple of Gb, it can take very long to process all the files. I used C++ because thats probably the fastest performance one can get . Even on an old Intel Core2Duo machine, each of the tsv files did not take longer than 10-15 minutes at the most.

OBJECTIVE OF PART 1

What we are doing here, is process each tsv file and create an sql file which contains INSERT statements for each tsv row. Then, each of these sql files is processed directly in MySQL. This is faster than trying to do a direct Mysql insert using C++ code.

The code uses wxWidgets library for some of the functions. At some point in the future, I aim to make a GUI app to process and retrieve the mysql data so that is the reason why wxWidgets is part of the equation.

CODE SETUP

I am using this on Ubuntu but it can be easily run on Windows as well with minimal changes since I am using STL and wxWidgets which is cross-compatible with Unix, Windows and MacOS.

To set up wxWidgets on Windows, take a look at http://truelogic.org/wordpress/2021/12/06/setting-up-wxwidgets-environment-for-c-in-windows-10/

To set up wxWidgets in Ubuntu or Debian Linux, take a look at http://truelogic.org/wordpress/2021/10/31/setting-up-wxwidgets-environment-for-c-in-ubuntu-linux/

MYSQL TABLES

The mysql dump for the table structures is given below. This can be used to create the database

-- Adminer 4.7.7 MySQL dump

SET NAMES utf8;
SET time_zone = '+00:00';
SET foreign_key_checks = 0;
SET sql_mode = 'NO_AUTO_VALUE_ON_ZERO';

DROP TABLE IF EXISTS `name_basics`;
CREATE TABLE `name_basics` (
  `ID` bigint(20) NOT NULL AUTO_INCREMENT,
  `nconst` varchar(20) DEFAULT NULL,
  `primaryName` varchar(255) DEFAULT NULL,
  `birthYear` varchar(10) DEFAULT NULL,
  `deathYear` varchar(10) DEFAULT NULL,
  `primaryProfession` varchar(255) DEFAULT NULL,
  `knownForTitles` text,
  PRIMARY KEY (`ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;


DROP TABLE IF EXISTS `title_akas`;
CREATE TABLE `title_akas` (
  `ID` bigint(20) NOT NULL AUTO_INCREMENT,
  `titleid` varchar(20) NOT NULL,
  `ordering` int(11) NOT NULL,
  `title` varchar(255) CHARACTER SET utf8 NOT NULL,
  `region` varchar(5) DEFAULT NULL,
  `language` varchar(25) DEFAULT NULL,
  `types` varchar(15) DEFAULT NULL,
  `attributes` varchar(50) DEFAULT NULL,
  `isOriginalTitle` smallint(6) DEFAULT NULL,
  PRIMARY KEY (`ID`),
  KEY `titleid_ordering` (`titleid`,`ordering`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;


DROP TABLE IF EXISTS `title_basics`;
CREATE TABLE `title_basics` (
  `ID` bigint(20) NOT NULL AUTO_INCREMENT,
  `tconst` varchar(20) DEFAULT NULL,
  `titletype` varchar(15) DEFAULT NULL,
  `primaryTitle` varchar(255) DEFAULT NULL,
  `originalTitle` varchar(255) DEFAULT NULL,
  `isAdult` smallint(6) DEFAULT NULL,
  `startYear` varchar(10) DEFAULT NULL,
  `endYear` varchar(10) DEFAULT NULL,
  `runtimeMinutes` varchar(5) DEFAULT NULL,
  `genres` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;


DROP TABLE IF EXISTS `title_crew`;
CREATE TABLE `title_crew` (
  `ID` bigint(20) NOT NULL AUTO_INCREMENT,
  `tconst` varchar(20) DEFAULT NULL,
  `directors` varchar(1000) DEFAULT NULL,
  `writers` varchar(1000) DEFAULT NULL,
  PRIMARY KEY (`ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;


DROP TABLE IF EXISTS `title_episode`;
CREATE TABLE `title_episode` (
  `ID` bigint(20) NOT NULL AUTO_INCREMENT,
  `tconst` varchar(20) DEFAULT NULL,
  `parenttconst` varchar(20) DEFAULT NULL,
  `seasonNumber` varchar(5) DEFAULT NULL,
  `episodeNumber` varchar(5) DEFAULT NULL,
  PRIMARY KEY (`ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;


DROP TABLE IF EXISTS `title_principals`;
CREATE TABLE `title_principals` (
  `ID` bigint(20) NOT NULL AUTO_INCREMENT,
  `tconst` varchar(20) DEFAULT NULL,
  `ordering` varchar(5) DEFAULT NULL,
  `nconst` varchar(20) DEFAULT NULL,
  `category` varchar(50) DEFAULT NULL,
  `job` varchar(255) DEFAULT NULL,
  `charac` varchar(100) DEFAULT NULL,
  PRIMARY KEY (`ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;


DROP TABLE IF EXISTS `title_ratings`;
CREATE TABLE `title_ratings` (
  `ID` bigint(20) NOT NULL AUTO_INCREMENT,
  `tconst` varchar(20) DEFAULT NULL,
  `averageRating` decimal(5,2) DEFAULT NULL,
  `numVotes` mediumtext,
  PRIMARY KEY (`ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;


-- 2021-12-16 14:05:28

SOURCE CODE

read-tsv.h
#include <wx/wxprec.h>
#ifndef WX_PRECOMP
	#include <wx/wx.h>
#endif
#include <wx/textfile.h>
#include <iostream>
#include <fstream>

class TSVReader {

	private:
	   wxString mPath;
	   wxString mTable;
	   wxString mFieldListSQL;
	   wxString getFieldList(wxString tablename);
	   wxString mSQLFileName;
	   std::ofstream mFileOut;
	   
	public:
	   TSVReader(wxString path, wxString table);
	   void readData();
	   void processLine(std::wstring, int lineCount);

};

// for each tsv file, define mysql table name, 
// field count and name of fields. 
// The field info ignores the ID column which is an autoincremented PK 
//
const static wxString TABLE_TITLE_AKAS = wxT("title_akas");
const static int TITLE_AKAS_FIELD_COUNT = 8;
const static wxString TITLE_AKAS_FIELDS[] = {wxT("titleid"), wxT("ordering"), wxT("title"), wxT("region"),
			wxT("language"), wxT("types"), wxT("attributes"), wxT("isOriginalTitle")};


const static wxString TABLE_TITLE_BASICS = wxT("title_basics");
const static int TITLE_BASICS_FIELD_COUNT = 9;
const static wxString TITLE_BASICS_FIELDS[] = {wxT("tconst"), wxT("titleType"), wxT("primaryTitle"), wxT("originalTitle"), wxT("isAdult"),
			wxT("startYear"), wxT("endYear"), wxT("runtimeMinutes"), wxT("genres")};


const static wxString TABLE_TITLE_CREW = wxT("title_crew");
const static int TITLE_CREW_FIELD_COUNT = 3;
const static wxString TITLE_CREW_FIELDS[] = {wxT("tconst"), wxT("directors"), wxT("writers")};

const static wxString TABLE_TITLE_EPISODE = wxT("title_episode");
const static int TITLE_EPISODE_FIELD_COUNT = 4;
const static wxString TITLE_EPISODE_FIELDS[] = {wxT("tconst"), wxT("parenttconst"), wxT("seasonNumber"), wxT("episodeNumber")};

const static wxString TABLE_TITLE_PRINCIPALS = wxT("title_principals");
const static int TITLE_PRINCIPALS_FIELD_COUNT = 6;
const static wxString TITLE_PRINCIPALS_FIELDS[] = {wxT("tconst"), wxT("ordering"), wxT("nconst"), wxT("category"), wxT("job"), wxT("charac")};

const static wxString TABLE_TITLE_RATINGS = wxT("title_ratings");
const static int TITLE_RATINGS_FIELD_COUNT = 3;
const static wxString TITLE_RATINGS_FIELDS[] = {wxT("tconst"), wxT("averageRating"), wxT("numVotes")};

const static wxString TABLE_NAME_BASICS = wxT("name_basics");
const static int NAME_BASICS_FIELD_COUNT = 6;
const static wxString NAME_BASICS_FIELDS[] = {wxT("nconst"), wxT("primaryName"), wxT("birthYear"), wxT("deathYear"), wxT("primaryProfession"),
			wxT("knownForTitles")};



read-tsv.cpp
#include <codecvt>
#include <sstream>
#include <wx/tokenzr.h>
#include "read-tsv.h"

/**
 * Ctor
 * @param path - path to tsv file
 * @param table - name of mysql table
 */
TSVReader::TSVReader(wxString path, wxString table) {
	mPath = path;
	mTable = table;
	mFieldListSQL = getFieldList(mTable);
	mSQLFileName = table.append(".sql");

	//delete sql file if it already exists
	if (wxFileExists(mSQLFileName))
		wxRemoveFile(mSQLFileName);
}

/**
 * Get the INSERT sql fragment with list of mysql field names based on the table name
 * @param tablename - name of mysql table
 * @return retVal - SQL insert statement prefix
 */
wxString TSVReader::getFieldList(wxString tablename) {
	wxString retVal = wxT("(");
	if (mTable == TABLE_TITLE_AKAS) {
		for(int i = 0; i < TITLE_AKAS_FIELD_COUNT; i++) {
			retVal.Append(wxT("`")).Append(TITLE_AKAS_FIELDS[i]).Append(wxT("`,"));
		}
	}
	else if (mTable == TABLE_TITLE_BASICS) {
		for(int i = 0; i < TITLE_BASICS_FIELD_COUNT; i++) {
			retVal.Append(wxT("`")).Append(TITLE_BASICS_FIELDS[i]).Append(wxT("`,"));
		}
	}
	else if (mTable == TABLE_TITLE_CREW) {
		for(int i = 0; i < TITLE_CREW_FIELD_COUNT; i++) {
			retVal.Append(wxT("`")).Append(TITLE_CREW_FIELDS[i]).Append(wxT("`,"));
		}
	}
	else if (mTable == TABLE_TITLE_EPISODE) {
		for(int i = 0; i < TITLE_EPISODE_FIELD_COUNT; i++) {
			retVal.Append(wxT("`")).Append(TITLE_EPISODE_FIELDS[i]).Append(wxT("`,"));
		}
	}
	else if (mTable == TABLE_TITLE_PRINCIPALS) {
		for(int i = 0; i < TITLE_PRINCIPALS_FIELD_COUNT; i++) {
			retVal.Append(wxT("`")).Append(TITLE_PRINCIPALS_FIELDS[i]).Append(wxT("`,"));
		}
	}
	else if (mTable == TABLE_TITLE_RATINGS) {
		for(int i = 0; i < TITLE_RATINGS_FIELD_COUNT; i++) {
			retVal.Append(wxT("`")).Append(TITLE_RATINGS_FIELDS[i]).Append(wxT("`,"));
		}
	}
	else if (mTable == TABLE_NAME_BASICS) {
		for(int i = 0; i < NAME_BASICS_FIELD_COUNT; i++) {
			retVal.Append(wxT("`")).Append(NAME_BASICS_FIELDS[i]).Append(wxT("`,"));
		}
	}


	// remove trailing comma
	if (retVal != wxT("("))
		retVal = retVal.Left(retVal.Len()-1);
	retVal.Append(wxT(") "));
	return retVal;
}

/**
 * Read data line by line from tsv file. Each line is then processed to generate an INSERT sql statement. 
 * These statements are appended to an sql file.
 */
void TSVReader::readData() {

	std::wifstream infile(mPath);
	infile.imbue(std::locale(infile.getloc(),
	       new std::codecvt_utf8<wchar_t, 0x10ffff>));
	std::wstring line;

	mFileOut.open(mSQLFileName, std::ios_base::app);
	
	long lineCount = 0;
	wxPuts(wxT("Reading ") + mPath);
	bool isTitleRow = true; // first row is column titles

	while (getline(infile, line)) {
		if (isTitleRow) {
			isTitleRow = false;
			continue;
		}
		processLine(line, lineCount);
		lineCount++;

	}
	mFileOut.close();
	wxPuts(wxString::Format(wxT("%ld lines"), lineCount));
}

/**
 * Convert each line in tsv file to an INSERT sql statement
 * @param line - tsv line 
 * @param lineCount - counter of lines read from tsv file
 */
void TSVReader::processLine(std::wstring line, int lineCount) {
	wxString sql = wxT("insert into ");
	sql.Append(mTable).Append(mFieldListSQL).Append(wxT("values ("));
	wxString token = wxT("");
	wxStringTokenizer tokenizer(line, "\t");
	while (tokenizer.HasMoreTokens()) {
		token = tokenizer.GetNextToken();
		token.Replace(wxT("'"), wxT("\\;"));
		sql.Append("'").Append(token).Append("',");
	}
	// remove trailing comma
	if (token != wxT(""))
	     sql = sql.Left(sql.Len()-1);
	sql.Append(wxT(");"));
	mFileOut << sql << std::endl;

}

main.cpp
#include "read-tsv.h"

///
// First arg is the tsv filename, Second arg is the mysql table name
///
int main(int argc, char **argv) {
	wxInitialize();
	wxString filename = wxT("");
	wxString table = wxT("");
	if (argc > 2) {
		filename = argv[1];
		table = argv[2];
	} else {
	   wxPuts("No tsv file and/or tablename specified");
	   return -1;
	}
	wxPuts("Reading file.." + filename);
	TSVReader tsv(filename, table);
	tsv.readData();
	wxPuts("Closed file");
	wxUninitialize();
}

To build the source under Windows change the sample makefile as provided in the Windows setup of wxWidgets blog post. For Ubuntu use

g++ main.cpp read-tsv.cpp ``wx-config --cxxflags --libs`` -o read-tsv

Use single backticks in the above command and not double backticks as shown above

EXECUTION

For each tsv file, run the command

./read-tsv <tsv filename> <table name>
Eg. ./read-tsv name.basics.tsv name_basics

Once you have run this for all the tsv files, you will end up with corresponding sql files for each table. Process each of these sql files in mysql . You can do either by indirection at the command prompt:

mysql --user=username --password=mypassword imdb_db < name_basics.sql

Or you can do it using the source command:

mysql --user=username --password=mypassword 
> use imdb_db;
> source 'name_basics.sql';

Note that if you run the sql file twice for some reason, be sure to truncate the table first, otherwise it will insert duplicate data.

The table screenshots with the imported data are shown below:

NEXT STEPS

In the next blog post, we will look into optimizing the database and cleaning it up a bit.

Be the first to comment

Leave a Reply

Your email address will not be published.


*