Game Prediction System Part 2

Game Prediction System Part 2

Data Retrieving

_Data Source and version control_


crawler with Urllib and BS4

  • Something just be different with last part’s website,the web page on WAN PLUS has changed,this become a static webpage,so that we can just use bs4 to get what we want
  • First Step is to get the different game series’s order. The order is very important,the urls http://www.wanplus.com/event/173.html have different order like 173 and so on,so we must find every series’ order.At the first time i want to find all the orders myself,and i found that it is two much workload. So,i put it to the work to the bs4,as follows:
Code1:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# games_order = dict()
# games_order['IEM_Oakland_2016'] = 311
# games_order['NEST_2016'] = 298
# games_order['KeSPA_2016'] = 303
# games_order['S_2016'] = 177
# games_order['LPL_Regional_Qualifier_2016'] = 250
# games_order['Demacia_2016'] = 230
# games_order['LMS_Summer_2016'] = 221
# games_order['LCS_NA_Summer_2016'] = 217
# games_order['LCS_EU_Summer_2016'] = 216
# games_order['LPL_Summer_2016'] = 218
# games_order['LCK_Summer_2016'] = 215
# games_order['MSI_2016'] = 178
# games_order['IEM_Katowice_2016'] = 176
# games_order['LMS_Spring_2016'] = 173
# games_order['LCS_NA_Spring_2016'] = 169
# games_order['LCS_EU_Spring_2016'] = 170
# games_order['LPL_Spring_2016'] = 165
# games_order['LCK_Spring_2016'] = 172
# games_order['IEM_Cologne_2015'] = 164
# games_order['NEST_2015'] = 154
# games_order['Demacia_2015'] = 158
# games_order['IEM_San_Jose_2015'] = 152
# games_order['KeSPA_2015'] = 157
# games_order['S_2015'] = 136
# ......
  • _these are some orders that i have found before.._
Code2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
conn = sqlite3.connect('lol.sqlite')
cur = conn.cursor()
cur.executescript('''
DROP TABLE IF EXISTS Games ;

CREATE TABLE Games (
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
duration TEXT,
series TEXT,
team1 TEXT,
team2 TEXT,
result1 INTEGER,
result2 INTEGER,
bo INTEGER
);
''')
  • _use sql to create the table_
1
2
3
4
5
6
7
8
9
10
11
for i in range(8):
url = base_url + str(i+1)
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html,'html.parser')
links = soup.findAll('div',class_ = 'text_t')
for link in links:
print (link.find('a').get('href'))
# num = re.findall('[0-9]+',link.find('a').get('href'))
new_urls.append('http://www.wanplus.com' + link.find('a').get('href'))

print (new_urls)
  • _in this code segment,i use bs4 to find every order in the tag a’s attribute href’s text,at the first time i want to use REGEX to find the numbers,and then i found that the text is the url’s portion,just use it_
Code3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
for url in new_urls:
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html,'html.parser')
title = soup.find('div',class_ = 'caption-outer')
duration = soup.find('div',class_ = 'caption-outer').find('span').text
str = title.contents[1].text
if str[0] == ' ':
str = str[1:]
print (str)
bo = 0
games = soup.findAll('div',class_ = 'match-team')
for game in games:
team = game.findAll('span',class_ = 'team-name')
result = game.find('em',class_ = 'team-vs').findAll('i')
team_name = [i.text for i in team]
game_result = [int(i.text) for i in result]
if max(game_result) == 1:
if sum(game_result) == 1:
bo = 1
elif sum(game_result) == 2:
bo = 2
elif max(game_result) == 2:
bo = 3
elif max(game_result) == 3:
bo = 5
cur.execute('''INSERT OR IGNORE INTO Games (series, duration, team1, team2, result1,
result2, bo) VALUES ( ?, ?, ?, ?, ?, ?, ? )''',( str, duration, team_name[0], team_name[1],
game_result[0], game_result[1], bo ))
print (bo)


conn.commit()
  • _find ‘match-team’,’team-name’,’team-vs’,we can get the team information and game information includes duration and series name,so that we can put them into our database ‘lol’_
  • There are also some tips and notes:

    • use class name in bs4 should be like this title = soup.find('div',class_ = 'caption-outer'),other usages you should read the documentations
    • everything in db should not be unique except id
    • just use bs4 for you to find something you don’t want to do ti hand by hand
    • don’t forget to commit your database after everything has done
    • the way to classify number of BO is still a problem, hope i will resolve it later
  • That’s all we get the game informations, in the next part we may be start our machine learning portion,very excited!

  • _See you_