Working with original data in Datascience lectures

Hi,

I am trying to work from the real datasets, e.g., in Merge section, I try to retrieve data from ratings.csv and books.csv from github directly, rather than the datasets prepared in QEDS package.

I was able to do most of it (but there is still some minor problems on title’s names for 6 obs where title names has quotation in the middle, or double quotation, e.g. “” .) and it took me a while. However, the codes were a bit unnecessarily long (not to mention the Big-O things if datasets are much bigger, what I did would be really slow.) and awkward (I wrote some line manually for problematic titles.) When I looked into the QEDS github regarding the function to retrieve data, it involves somethings that are beyond me at this point (for example what is LOGGER?)

Below is how I retrieved the datasets only for books.csv. It is more compliated than the retrieving the datasets for ratings.csv. Here are problems:

  • The file was written as csv, but when downloading it, it becomes *.txt for some unknown reason to me.

  • The text file was encoded in different format than usual. Need to specify the keyword encoding as UTF-8

  • The appearance of the quotation marks (") makes things harder. When put into the DataFrame this quotation messed up the author’s names and title’s names. It was even worse when title has more than " in the middle.

If you can make an appendix for how you prepared the datasets for every datasets used in QEDS lectures, that would be great.

Thanks!

PS This code below was checked so that it displayed almost identical values for author’s names but not for title names, as is the one from QEDS. There are 6 obs that the title names are mismatched for the double quotation, but it at least conveys the same information however.

import pandas as pd
fin = open('books.txt', encoding="utf8") # There was some error if utf8 is not specified.
line = fin.readline()
line = line.split(',') # csv contains commas so .split() separates this into list of strings.
line[-1] = line[-1].replace('\n','') # delete the new line sign from the last column
book_id = []
authors  = []
title = []
### Commas issue: Any slot with commas will begin with the quotation " mark.
for line in fin:
    index3 = 10
    line = line.split(',') 
    line[0] = int(line[0]) # line[0] is book_id.  Makes it integers
    book_id.append(line[0])
    if book_id[-1] == 9265:
        print(line)
    if '"' in line[7]: # line[7] is author's name(s).
        line[7] = line[7].replace('"','')
        # 100 is just a made-up number, assuming that no book was written by 90 authors.
        for i in range(8,100):  # enumerate from 8 onward
            if '"' in line[i]:
                line[i] = line[i].replace('"','')
                line[7] = ','.join(line[7:i+1])
                index = i+3 # This would be the column index for title
                index2 = i+2 #This is the column for the full title name
                break # break the for loop so that we don't actually go to 100.
        authors.append(line[7])
        if '"' in line[index2]: # check if full title may have commas
            if line[index2].count('"') == 1:
                for k in range(index2+1, index2+10): # This column is not used so we just shift it ahead
                    if '"' in line[k]: # see which index where the last comma end
                        indexk = k # record this index
                        break
                index = indexk + 1
        if '"' in line[index]:
            line[index] = line[index].strip('"')
            # hoping that title's name shouldn't have to many commas
            for j in range(index, index+10): # index + 100 causes index out of range error.
                if '"' in line[j]:
                    line[j] = line[j].replace('"','')
                    line[index] = ','.join(line[index:j+1])
                    break
        title.append(line[index]) # This exhausts all cases with multiple authors and commas in title
    else:
        authors.append(line[7]) # This is the case with single author
        if '"' in line[9]: # check if this full title may have comma(s) 
            if (line[9].count('"') == 1) & (book_id[-1] != 6581) & (book_id[-1] != 9265): 
                for k in range(10, 20): # This column is not used so we just shift it ahead
                    if '"' in line[k]: # see which index where the last comma end
                        indexk = k # record this index
                        break
                index3 = indexk + 1
            elif (book_id[-1] == 6582): # (book_id[-1] == 9610):
                for k in range(10, 20): # This column is not used so we just shift it ahead
                    if '"' in line[k]: # see which index where the last comma end
                        indexk = k # record this index
                        break
                index3 = indexk + 1
            elif (book_id[-1] == 9265): 
                index3 = 12        
        if '"' in line[index3]: # line[10] is title's name for book written by 1 author
            if (line[index3].count('"') == 1) & (book_id[-1] != 9265) : 
                line[index3] = line[index3].replace('"','')
                for j in range(index3+1, index3+10):
                    if '"' in line[j]:
                        line[j] = line[j].replace('"','')
                        line[index3] = ','.join(line[index3:j+1])
                        break
            elif book_id[-1] == 9265:
                line[index3] = ','.join(line[index3:index3+3])
        title.append(line[index3].lstrip('"').rstrip('"'))
books = pd.DataFrame()
books['book_id'] = book_id
books['authors'] = authors
books['title'] = title
fin.close()