Game Prediction System Part 2

Data Retrieving

_Data Source and version control_

crawler with Urllib and BS4

Something just be different with last part’s website,the web page on WAN PLUS has changed,this become a static webpage,so that we can just use bs4 to get what we want
First Step is to get the different game series’s order. The order is very important,the urls http://www.wanplus.com/event/173.html have different order like 173 and so on,so we must find every series’ order.At the first time i want to find all the orders myself,and i found that it is two much workload. So,i put it to the work to the bs4,as follows:

Code1:

# games_order = dict()
# games_order['IEM_Oakland_2016'] = 311
# games_order['NEST_2016'] = 298
# games_order['KeSPA_2016'] = 303
# games_order['S_2016'] = 177
# games_order['LPL_Regional_Qualifier_2016'] = 250
# games_order['Demacia_2016'] = 230
# games_order['LMS_Summer_2016'] = 221
# games_order['LCS_NA_Summer_2016'] = 217
# games_order['LCS_EU_Summer_2016'] = 216
# games_order['LPL_Summer_2016'] = 218
# games_order['LCK_Summer_2016'] = 215
# games_order['MSI_2016'] = 178
# games_order['IEM_Katowice_2016'] = 176
# games_order['LMS_Spring_2016'] = 173
# games_order['LCS_NA_Spring_2016'] = 169
# games_order['LCS_EU_Spring_2016'] = 170
# games_order['LPL_Spring_2016'] = 165
# games_order['LCK_Spring_2016'] = 172
# games_order['IEM_Cologne_2015'] = 164
# games_order['NEST_2015'] = 154
# games_order['Demacia_2015'] = 158
# games_order['IEM_San_Jose_2015'] = 152
# games_order['KeSPA_2015'] = 157
# games_order['S_2015'] = 136
# ......

_these are some orders that i have found before.._

Code2

conn = sqlite3.connect('lol.sqlite')
cur = conn.cursor()
cur.executescript('''
DROP TABLE IF EXISTS Games ;

CREATE TABLE Games (
id       INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
duration TEXT,
series   TEXT,
team1    TEXT,
team2    TEXT,
result1  INTEGER,
result2  INTEGER,
bo       INTEGER
);
''')

_use sql to create the table_

for i in range(8):
    url = base_url + str(i+1)
    html = urllib.request.urlopen(url)
    soup = BeautifulSoup(html,'html.parser')
    links = soup.findAll('div',class_ = 'text_t')
    for link in links:
        print (link.find('a').get('href'))
        # num = re.findall('[0-9]+',link.find('a').get('href'))
        new_urls.append('http://www.wanplus.com' + link.find('a').get('href'))

print (new_urls)

_in this code segment,i use bs4 to find every order in the tag a’s attribute href’s text,at the first time i want to use REGEX to find the numbers,and then i found that the text is the url’s portion,just use it_

Code3

for url in new_urls:
    html = urllib.request.urlopen(url)
    soup = BeautifulSoup(html,'html.parser')
    title = soup.find('div',class_ = 'caption-outer')
    duration = soup.find('div',class_ = 'caption-outer').find('span').text
    str = title.contents[1].text
    if str[0] == ' ':
        str = str[1:]
    print (str)
    bo = 0
    games = soup.findAll('div',class_ = 'match-team')
    for game in games:
        team = game.findAll('span',class_ = 'team-name')
        result = game.find('em',class_ = 'team-vs').findAll('i')
        team_name = [i.text for i in team]
        game_result = [int(i.text) for i in result]
        if max(game_result) == 1:
            if sum(game_result) == 1:
                bo = 1
            elif sum(game_result) == 2:
                bo = 2
        elif max(game_result) == 2:
            bo = 3
        elif max(game_result) == 3:
            bo = 5
        cur.execute('''INSERT OR IGNORE INTO Games (series, duration, team1, team2, result1,
                       result2, bo) VALUES ( ?, ?, ?, ?, ?, ?, ? )''',( str, duration, team_name[0], team_name[1],
                                                                    game_result[0], game_result[1], bo ))
        print (bo)


conn.commit()

_find ‘match-team’,’team-name’,’team-vs’,we can get the team information and game information includes duration and series name,so that we can put them into our database ‘lol’_
There are also some tips and notes:
- use class name in bs4 should be like this title = soup.find('div',class_ = 'caption-outer'),other usages you should read the documentations
- everything in db should not be unique except id
- just use bs4 for you to find something you don’t want to do ti hand by hand
- don’t forget to commit your database after everything has done
- the way to classify number of BO is still a problem, hope i will resolve it later
That’s all we get the game informations, in the next part we may be start our machine learning portion,very excited!
_See you_