How to parse nested table from HTML link using BeautifulSoup in Python?
All,
I am trying to Parse table from this link http://web1.ncaa.org/stats/StatsSrv/careersearch.
Please Note: For searching under "School/Sport Search" select All for School, Year -2005-2006, Sport -Football, Division I. The column I am trying to Parse is the School Names, and if you click on School Name.More information will output. From that link/Table I would like to Parse "Stadium Capacity" for each and every School. My question is Is something like this possible ? If yes,how ? I am new to python and BeautifulSoup, if you can provide explanation that will be Great!
Note: There are 239 results,
To Summarize: so Basically I would like to parse School Names along with their Stadium Capacity and convert it into Pandas Data-frame
import requests
from bs4 import BeautifulSoup
URL = "http://web1.ncaa.org/stats/StatsSrv/careerteam"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())
python-3.x pandas beautifulsoup html-parsing html-parser
add a comment |
All,
I am trying to Parse table from this link http://web1.ncaa.org/stats/StatsSrv/careersearch.
Please Note: For searching under "School/Sport Search" select All for School, Year -2005-2006, Sport -Football, Division I. The column I am trying to Parse is the School Names, and if you click on School Name.More information will output. From that link/Table I would like to Parse "Stadium Capacity" for each and every School. My question is Is something like this possible ? If yes,how ? I am new to python and BeautifulSoup, if you can provide explanation that will be Great!
Note: There are 239 results,
To Summarize: so Basically I would like to parse School Names along with their Stadium Capacity and convert it into Pandas Data-frame
import requests
from bs4 import BeautifulSoup
URL = "http://web1.ncaa.org/stats/StatsSrv/careerteam"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())
python-3.x pandas beautifulsoup html-parsing html-parser
add a comment |
All,
I am trying to Parse table from this link http://web1.ncaa.org/stats/StatsSrv/careersearch.
Please Note: For searching under "School/Sport Search" select All for School, Year -2005-2006, Sport -Football, Division I. The column I am trying to Parse is the School Names, and if you click on School Name.More information will output. From that link/Table I would like to Parse "Stadium Capacity" for each and every School. My question is Is something like this possible ? If yes,how ? I am new to python and BeautifulSoup, if you can provide explanation that will be Great!
Note: There are 239 results,
To Summarize: so Basically I would like to parse School Names along with their Stadium Capacity and convert it into Pandas Data-frame
import requests
from bs4 import BeautifulSoup
URL = "http://web1.ncaa.org/stats/StatsSrv/careerteam"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())
python-3.x pandas beautifulsoup html-parsing html-parser
All,
I am trying to Parse table from this link http://web1.ncaa.org/stats/StatsSrv/careersearch.
Please Note: For searching under "School/Sport Search" select All for School, Year -2005-2006, Sport -Football, Division I. The column I am trying to Parse is the School Names, and if you click on School Name.More information will output. From that link/Table I would like to Parse "Stadium Capacity" for each and every School. My question is Is something like this possible ? If yes,how ? I am new to python and BeautifulSoup, if you can provide explanation that will be Great!
Note: There are 239 results,
To Summarize: so Basically I would like to parse School Names along with their Stadium Capacity and convert it into Pandas Data-frame
import requests
from bs4 import BeautifulSoup
URL = "http://web1.ncaa.org/stats/StatsSrv/careerteam"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())
python-3.x pandas beautifulsoup html-parsing html-parser
python-3.x pandas beautifulsoup html-parsing html-parser
asked Jan 19 at 3:09
Data_is_PowerData_is_Power
1528
1528
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
My question is Is something like this possible ?
Yes.
If yes,how ?
There is a lot going in the code below. But the main point is to figure out the post requests being made by the browser and then emulate that using Requests. We can find out the request being made through the "network" tab in the inspect tool.
First we make the 'search' post request. This gives a left and right table. Clicking on the left table gives us the schools in that area. But if we observe carefully clicking on the area link also is a post request (which we have to do using requests)
Eg. Clicking on 'Air Force - Eastern Ill.' gives us a table containing the links of schools in that area. Then we have to go to that school link and figure out the capacity.
Since clicking on each of the school link is also a post request we have to emulate and this returns the school page. From here we scrap the school name and capacity.
You can read Advanced Usage of requests to know about Session objects, Making a request to read about making request with Requests.
import requests
from bs4 import BeautifulSoup
import pandas as pd
end_list=
s = requests.Session()
URL = "http://web1.ncaa.org/stats/StatsSrv/careersearch"
data={'doWhat': 'teamSearch','searchOrg': 'X', 'academicYear': 2006, 'searchSport':'MFB','searchDiv': 1}
r = s.post(URL,data=data)
soup=BeautifulSoup(r.text,'html.parser')
area_list=soup.find_all('table')[8].find_all('tr')
area_count=len(area_list)#has no of areas + 1 tr 'Total Results of Search: 239'
for idx in range(0,area_count):
data={
'sortOn': 0,
'doWhat': 'showIdx',
'playerId':'' ,'coachId': '',
'orgId':'' ,
'academicYear':'' ,
'division':'' ,
'sportCode':'' ,
'idx': idx
}
r = s.post(URL,data=data)
soup=BeautifulSoup(r.text,'html.parser')
last_table=soup.find_all('table')[-1]#last table
for tr in last_table.find_all('tr'):
link_td=tr.find('td',class_="text")
try:
link_a=link_td.find('a')['href']
data_params=link_a.split('(')[1][:-2].split(',')
try:
#print(data_params)
sports_code=data_params[2].replace("'","").strip()
division=int(data_params[3])
player_coach_id=int(data_params[0])
academic_year=int(data_params[1])
org_id=int(data_params[4])
#print(sports_code,division,player_coach_id,academic_year,org_id)
data={
'sortOn': 0,
'doWhat': 'display',
'playerId': player_coach_id,
'coachId': player_coach_id,
'orgId': org_id,
'academicYear': academic_year,
'division':division,
'sportCode':sports_code,
'idx':''
}
url='http://web1.ncaa.org/stats/StatsSrv/careerteam'
r = s.post(url,data=data)
soup2=BeautifulSoup(r.text,'html.parser')
institution_name=soup2.find_all('table')[1].find_all('tr')[2].find_all('td')[1].text.strip()
capacity=soup2.find_all('table')[4].find_all('tr')[2].find_all('td')[1].text.strip()
#print([institution_name, capacity])
end_list.append([institution_name, capacity])
except IndexError:
pass
except AttributeError:
pass
#print(end_list)
headers=['School','Capacity']
df=pd.DataFrame(end_list, columns=headers)
print(df)
Output
School Capacity
0 Air Force 46,692
1 Akron 30,000
2 Alabama 101,821
3 Alabama A&M; 21,000
4 Alabama St. 26,500
5 Albany (NY) 8,500
6 Alcorn 22,500
7 Appalachian St. 30,000
8 Arizona 55,675
9 Arizona St. 64,248
10 Ark.-Pine Bluff 14,500
11 Arkansas 72,000
12 Arkansas St. 30,708
13 Army West Point 38,000
14 Auburn 87,451
15 Austin Peay 10,000
16 BYU 63,470
17 Ball St. 22,500
18 Baylor 45,140
19 Bethune-Cookman 9,601
20 Boise St. 36,387
21 Boston College 44,500
22 Bowling Green 24,000
23 Brown 20,000
24 Bucknell 13,100
25 Buffalo 29,013
26 Butler 5,647
27 Cal Poly 11,075
28 California 62,467
29 Central Conn. St. 5,500
.. ... ...
209 UCLA 91,136
210 UConn 40,000
211 UNI 16,324
212 UNLV 36,800
213 UT Martin 7,500
214 UTEP 52,000
215 Utah 45,807
216 Utah St. 25,100
217 VMI 10,000
218 Valparaiso 5,000
219 Vanderbilt 40,350
220 Villanova 12,000
221 Virginia 61,500
222 Virginia Tech 65,632
223 Wagner 3,300
224 Wake Forest 31,500
225 Washington 70,138
226 Washington St. 32,740
227 Weber St. 17,500
228 West Virginia 60,000
229 Western Caro. 13,742
230 Western Ill. 16,368
231 Western Ky. 22,113
232 Western Mich. 30,200
233 William & Mary 12,400
234 Wisconsin 80,321
235 Wofford 13,000
236 Wyoming 29,181
237 Yale 64,269
238 Youngstown St. 20,630
[239 rows x 2 columns]
Note:
This will take a long time. We are scrapping >239 pages. So be patient. Might take 15 mins or longer.
Would it be possible for you to resolve this query :stackoverflow.com/questions/54279547/… I CANNOT find Params under Network data for this query
– Data_is_Power
Jan 20 at 23:27
@Data_is_Power I will try. I am not at home right now. I will let you know ASAP.
– Bitto Bennichan
Jan 20 at 23:29
Great. Thank you so much again! :)
– Data_is_Power
Jan 20 at 23:31
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54263740%2fhow-to-parse-nested-table-from-html-link-using-beautifulsoup-in-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
My question is Is something like this possible ?
Yes.
If yes,how ?
There is a lot going in the code below. But the main point is to figure out the post requests being made by the browser and then emulate that using Requests. We can find out the request being made through the "network" tab in the inspect tool.
First we make the 'search' post request. This gives a left and right table. Clicking on the left table gives us the schools in that area. But if we observe carefully clicking on the area link also is a post request (which we have to do using requests)
Eg. Clicking on 'Air Force - Eastern Ill.' gives us a table containing the links of schools in that area. Then we have to go to that school link and figure out the capacity.
Since clicking on each of the school link is also a post request we have to emulate and this returns the school page. From here we scrap the school name and capacity.
You can read Advanced Usage of requests to know about Session objects, Making a request to read about making request with Requests.
import requests
from bs4 import BeautifulSoup
import pandas as pd
end_list=
s = requests.Session()
URL = "http://web1.ncaa.org/stats/StatsSrv/careersearch"
data={'doWhat': 'teamSearch','searchOrg': 'X', 'academicYear': 2006, 'searchSport':'MFB','searchDiv': 1}
r = s.post(URL,data=data)
soup=BeautifulSoup(r.text,'html.parser')
area_list=soup.find_all('table')[8].find_all('tr')
area_count=len(area_list)#has no of areas + 1 tr 'Total Results of Search: 239'
for idx in range(0,area_count):
data={
'sortOn': 0,
'doWhat': 'showIdx',
'playerId':'' ,'coachId': '',
'orgId':'' ,
'academicYear':'' ,
'division':'' ,
'sportCode':'' ,
'idx': idx
}
r = s.post(URL,data=data)
soup=BeautifulSoup(r.text,'html.parser')
last_table=soup.find_all('table')[-1]#last table
for tr in last_table.find_all('tr'):
link_td=tr.find('td',class_="text")
try:
link_a=link_td.find('a')['href']
data_params=link_a.split('(')[1][:-2].split(',')
try:
#print(data_params)
sports_code=data_params[2].replace("'","").strip()
division=int(data_params[3])
player_coach_id=int(data_params[0])
academic_year=int(data_params[1])
org_id=int(data_params[4])
#print(sports_code,division,player_coach_id,academic_year,org_id)
data={
'sortOn': 0,
'doWhat': 'display',
'playerId': player_coach_id,
'coachId': player_coach_id,
'orgId': org_id,
'academicYear': academic_year,
'division':division,
'sportCode':sports_code,
'idx':''
}
url='http://web1.ncaa.org/stats/StatsSrv/careerteam'
r = s.post(url,data=data)
soup2=BeautifulSoup(r.text,'html.parser')
institution_name=soup2.find_all('table')[1].find_all('tr')[2].find_all('td')[1].text.strip()
capacity=soup2.find_all('table')[4].find_all('tr')[2].find_all('td')[1].text.strip()
#print([institution_name, capacity])
end_list.append([institution_name, capacity])
except IndexError:
pass
except AttributeError:
pass
#print(end_list)
headers=['School','Capacity']
df=pd.DataFrame(end_list, columns=headers)
print(df)
Output
School Capacity
0 Air Force 46,692
1 Akron 30,000
2 Alabama 101,821
3 Alabama A&M; 21,000
4 Alabama St. 26,500
5 Albany (NY) 8,500
6 Alcorn 22,500
7 Appalachian St. 30,000
8 Arizona 55,675
9 Arizona St. 64,248
10 Ark.-Pine Bluff 14,500
11 Arkansas 72,000
12 Arkansas St. 30,708
13 Army West Point 38,000
14 Auburn 87,451
15 Austin Peay 10,000
16 BYU 63,470
17 Ball St. 22,500
18 Baylor 45,140
19 Bethune-Cookman 9,601
20 Boise St. 36,387
21 Boston College 44,500
22 Bowling Green 24,000
23 Brown 20,000
24 Bucknell 13,100
25 Buffalo 29,013
26 Butler 5,647
27 Cal Poly 11,075
28 California 62,467
29 Central Conn. St. 5,500
.. ... ...
209 UCLA 91,136
210 UConn 40,000
211 UNI 16,324
212 UNLV 36,800
213 UT Martin 7,500
214 UTEP 52,000
215 Utah 45,807
216 Utah St. 25,100
217 VMI 10,000
218 Valparaiso 5,000
219 Vanderbilt 40,350
220 Villanova 12,000
221 Virginia 61,500
222 Virginia Tech 65,632
223 Wagner 3,300
224 Wake Forest 31,500
225 Washington 70,138
226 Washington St. 32,740
227 Weber St. 17,500
228 West Virginia 60,000
229 Western Caro. 13,742
230 Western Ill. 16,368
231 Western Ky. 22,113
232 Western Mich. 30,200
233 William & Mary 12,400
234 Wisconsin 80,321
235 Wofford 13,000
236 Wyoming 29,181
237 Yale 64,269
238 Youngstown St. 20,630
[239 rows x 2 columns]
Note:
This will take a long time. We are scrapping >239 pages. So be patient. Might take 15 mins or longer.
Would it be possible for you to resolve this query :stackoverflow.com/questions/54279547/… I CANNOT find Params under Network data for this query
– Data_is_Power
Jan 20 at 23:27
@Data_is_Power I will try. I am not at home right now. I will let you know ASAP.
– Bitto Bennichan
Jan 20 at 23:29
Great. Thank you so much again! :)
– Data_is_Power
Jan 20 at 23:31
add a comment |
My question is Is something like this possible ?
Yes.
If yes,how ?
There is a lot going in the code below. But the main point is to figure out the post requests being made by the browser and then emulate that using Requests. We can find out the request being made through the "network" tab in the inspect tool.
First we make the 'search' post request. This gives a left and right table. Clicking on the left table gives us the schools in that area. But if we observe carefully clicking on the area link also is a post request (which we have to do using requests)
Eg. Clicking on 'Air Force - Eastern Ill.' gives us a table containing the links of schools in that area. Then we have to go to that school link and figure out the capacity.
Since clicking on each of the school link is also a post request we have to emulate and this returns the school page. From here we scrap the school name and capacity.
You can read Advanced Usage of requests to know about Session objects, Making a request to read about making request with Requests.
import requests
from bs4 import BeautifulSoup
import pandas as pd
end_list=
s = requests.Session()
URL = "http://web1.ncaa.org/stats/StatsSrv/careersearch"
data={'doWhat': 'teamSearch','searchOrg': 'X', 'academicYear': 2006, 'searchSport':'MFB','searchDiv': 1}
r = s.post(URL,data=data)
soup=BeautifulSoup(r.text,'html.parser')
area_list=soup.find_all('table')[8].find_all('tr')
area_count=len(area_list)#has no of areas + 1 tr 'Total Results of Search: 239'
for idx in range(0,area_count):
data={
'sortOn': 0,
'doWhat': 'showIdx',
'playerId':'' ,'coachId': '',
'orgId':'' ,
'academicYear':'' ,
'division':'' ,
'sportCode':'' ,
'idx': idx
}
r = s.post(URL,data=data)
soup=BeautifulSoup(r.text,'html.parser')
last_table=soup.find_all('table')[-1]#last table
for tr in last_table.find_all('tr'):
link_td=tr.find('td',class_="text")
try:
link_a=link_td.find('a')['href']
data_params=link_a.split('(')[1][:-2].split(',')
try:
#print(data_params)
sports_code=data_params[2].replace("'","").strip()
division=int(data_params[3])
player_coach_id=int(data_params[0])
academic_year=int(data_params[1])
org_id=int(data_params[4])
#print(sports_code,division,player_coach_id,academic_year,org_id)
data={
'sortOn': 0,
'doWhat': 'display',
'playerId': player_coach_id,
'coachId': player_coach_id,
'orgId': org_id,
'academicYear': academic_year,
'division':division,
'sportCode':sports_code,
'idx':''
}
url='http://web1.ncaa.org/stats/StatsSrv/careerteam'
r = s.post(url,data=data)
soup2=BeautifulSoup(r.text,'html.parser')
institution_name=soup2.find_all('table')[1].find_all('tr')[2].find_all('td')[1].text.strip()
capacity=soup2.find_all('table')[4].find_all('tr')[2].find_all('td')[1].text.strip()
#print([institution_name, capacity])
end_list.append([institution_name, capacity])
except IndexError:
pass
except AttributeError:
pass
#print(end_list)
headers=['School','Capacity']
df=pd.DataFrame(end_list, columns=headers)
print(df)
Output
School Capacity
0 Air Force 46,692
1 Akron 30,000
2 Alabama 101,821
3 Alabama A&M; 21,000
4 Alabama St. 26,500
5 Albany (NY) 8,500
6 Alcorn 22,500
7 Appalachian St. 30,000
8 Arizona 55,675
9 Arizona St. 64,248
10 Ark.-Pine Bluff 14,500
11 Arkansas 72,000
12 Arkansas St. 30,708
13 Army West Point 38,000
14 Auburn 87,451
15 Austin Peay 10,000
16 BYU 63,470
17 Ball St. 22,500
18 Baylor 45,140
19 Bethune-Cookman 9,601
20 Boise St. 36,387
21 Boston College 44,500
22 Bowling Green 24,000
23 Brown 20,000
24 Bucknell 13,100
25 Buffalo 29,013
26 Butler 5,647
27 Cal Poly 11,075
28 California 62,467
29 Central Conn. St. 5,500
.. ... ...
209 UCLA 91,136
210 UConn 40,000
211 UNI 16,324
212 UNLV 36,800
213 UT Martin 7,500
214 UTEP 52,000
215 Utah 45,807
216 Utah St. 25,100
217 VMI 10,000
218 Valparaiso 5,000
219 Vanderbilt 40,350
220 Villanova 12,000
221 Virginia 61,500
222 Virginia Tech 65,632
223 Wagner 3,300
224 Wake Forest 31,500
225 Washington 70,138
226 Washington St. 32,740
227 Weber St. 17,500
228 West Virginia 60,000
229 Western Caro. 13,742
230 Western Ill. 16,368
231 Western Ky. 22,113
232 Western Mich. 30,200
233 William & Mary 12,400
234 Wisconsin 80,321
235 Wofford 13,000
236 Wyoming 29,181
237 Yale 64,269
238 Youngstown St. 20,630
[239 rows x 2 columns]
Note:
This will take a long time. We are scrapping >239 pages. So be patient. Might take 15 mins or longer.
Would it be possible for you to resolve this query :stackoverflow.com/questions/54279547/… I CANNOT find Params under Network data for this query
– Data_is_Power
Jan 20 at 23:27
@Data_is_Power I will try. I am not at home right now. I will let you know ASAP.
– Bitto Bennichan
Jan 20 at 23:29
Great. Thank you so much again! :)
– Data_is_Power
Jan 20 at 23:31
add a comment |
My question is Is something like this possible ?
Yes.
If yes,how ?
There is a lot going in the code below. But the main point is to figure out the post requests being made by the browser and then emulate that using Requests. We can find out the request being made through the "network" tab in the inspect tool.
First we make the 'search' post request. This gives a left and right table. Clicking on the left table gives us the schools in that area. But if we observe carefully clicking on the area link also is a post request (which we have to do using requests)
Eg. Clicking on 'Air Force - Eastern Ill.' gives us a table containing the links of schools in that area. Then we have to go to that school link and figure out the capacity.
Since clicking on each of the school link is also a post request we have to emulate and this returns the school page. From here we scrap the school name and capacity.
You can read Advanced Usage of requests to know about Session objects, Making a request to read about making request with Requests.
import requests
from bs4 import BeautifulSoup
import pandas as pd
end_list=
s = requests.Session()
URL = "http://web1.ncaa.org/stats/StatsSrv/careersearch"
data={'doWhat': 'teamSearch','searchOrg': 'X', 'academicYear': 2006, 'searchSport':'MFB','searchDiv': 1}
r = s.post(URL,data=data)
soup=BeautifulSoup(r.text,'html.parser')
area_list=soup.find_all('table')[8].find_all('tr')
area_count=len(area_list)#has no of areas + 1 tr 'Total Results of Search: 239'
for idx in range(0,area_count):
data={
'sortOn': 0,
'doWhat': 'showIdx',
'playerId':'' ,'coachId': '',
'orgId':'' ,
'academicYear':'' ,
'division':'' ,
'sportCode':'' ,
'idx': idx
}
r = s.post(URL,data=data)
soup=BeautifulSoup(r.text,'html.parser')
last_table=soup.find_all('table')[-1]#last table
for tr in last_table.find_all('tr'):
link_td=tr.find('td',class_="text")
try:
link_a=link_td.find('a')['href']
data_params=link_a.split('(')[1][:-2].split(',')
try:
#print(data_params)
sports_code=data_params[2].replace("'","").strip()
division=int(data_params[3])
player_coach_id=int(data_params[0])
academic_year=int(data_params[1])
org_id=int(data_params[4])
#print(sports_code,division,player_coach_id,academic_year,org_id)
data={
'sortOn': 0,
'doWhat': 'display',
'playerId': player_coach_id,
'coachId': player_coach_id,
'orgId': org_id,
'academicYear': academic_year,
'division':division,
'sportCode':sports_code,
'idx':''
}
url='http://web1.ncaa.org/stats/StatsSrv/careerteam'
r = s.post(url,data=data)
soup2=BeautifulSoup(r.text,'html.parser')
institution_name=soup2.find_all('table')[1].find_all('tr')[2].find_all('td')[1].text.strip()
capacity=soup2.find_all('table')[4].find_all('tr')[2].find_all('td')[1].text.strip()
#print([institution_name, capacity])
end_list.append([institution_name, capacity])
except IndexError:
pass
except AttributeError:
pass
#print(end_list)
headers=['School','Capacity']
df=pd.DataFrame(end_list, columns=headers)
print(df)
Output
School Capacity
0 Air Force 46,692
1 Akron 30,000
2 Alabama 101,821
3 Alabama A&M; 21,000
4 Alabama St. 26,500
5 Albany (NY) 8,500
6 Alcorn 22,500
7 Appalachian St. 30,000
8 Arizona 55,675
9 Arizona St. 64,248
10 Ark.-Pine Bluff 14,500
11 Arkansas 72,000
12 Arkansas St. 30,708
13 Army West Point 38,000
14 Auburn 87,451
15 Austin Peay 10,000
16 BYU 63,470
17 Ball St. 22,500
18 Baylor 45,140
19 Bethune-Cookman 9,601
20 Boise St. 36,387
21 Boston College 44,500
22 Bowling Green 24,000
23 Brown 20,000
24 Bucknell 13,100
25 Buffalo 29,013
26 Butler 5,647
27 Cal Poly 11,075
28 California 62,467
29 Central Conn. St. 5,500
.. ... ...
209 UCLA 91,136
210 UConn 40,000
211 UNI 16,324
212 UNLV 36,800
213 UT Martin 7,500
214 UTEP 52,000
215 Utah 45,807
216 Utah St. 25,100
217 VMI 10,000
218 Valparaiso 5,000
219 Vanderbilt 40,350
220 Villanova 12,000
221 Virginia 61,500
222 Virginia Tech 65,632
223 Wagner 3,300
224 Wake Forest 31,500
225 Washington 70,138
226 Washington St. 32,740
227 Weber St. 17,500
228 West Virginia 60,000
229 Western Caro. 13,742
230 Western Ill. 16,368
231 Western Ky. 22,113
232 Western Mich. 30,200
233 William & Mary 12,400
234 Wisconsin 80,321
235 Wofford 13,000
236 Wyoming 29,181
237 Yale 64,269
238 Youngstown St. 20,630
[239 rows x 2 columns]
Note:
This will take a long time. We are scrapping >239 pages. So be patient. Might take 15 mins or longer.
My question is Is something like this possible ?
Yes.
If yes,how ?
There is a lot going in the code below. But the main point is to figure out the post requests being made by the browser and then emulate that using Requests. We can find out the request being made through the "network" tab in the inspect tool.
First we make the 'search' post request. This gives a left and right table. Clicking on the left table gives us the schools in that area. But if we observe carefully clicking on the area link also is a post request (which we have to do using requests)
Eg. Clicking on 'Air Force - Eastern Ill.' gives us a table containing the links of schools in that area. Then we have to go to that school link and figure out the capacity.
Since clicking on each of the school link is also a post request we have to emulate and this returns the school page. From here we scrap the school name and capacity.
You can read Advanced Usage of requests to know about Session objects, Making a request to read about making request with Requests.
import requests
from bs4 import BeautifulSoup
import pandas as pd
end_list=
s = requests.Session()
URL = "http://web1.ncaa.org/stats/StatsSrv/careersearch"
data={'doWhat': 'teamSearch','searchOrg': 'X', 'academicYear': 2006, 'searchSport':'MFB','searchDiv': 1}
r = s.post(URL,data=data)
soup=BeautifulSoup(r.text,'html.parser')
area_list=soup.find_all('table')[8].find_all('tr')
area_count=len(area_list)#has no of areas + 1 tr 'Total Results of Search: 239'
for idx in range(0,area_count):
data={
'sortOn': 0,
'doWhat': 'showIdx',
'playerId':'' ,'coachId': '',
'orgId':'' ,
'academicYear':'' ,
'division':'' ,
'sportCode':'' ,
'idx': idx
}
r = s.post(URL,data=data)
soup=BeautifulSoup(r.text,'html.parser')
last_table=soup.find_all('table')[-1]#last table
for tr in last_table.find_all('tr'):
link_td=tr.find('td',class_="text")
try:
link_a=link_td.find('a')['href']
data_params=link_a.split('(')[1][:-2].split(',')
try:
#print(data_params)
sports_code=data_params[2].replace("'","").strip()
division=int(data_params[3])
player_coach_id=int(data_params[0])
academic_year=int(data_params[1])
org_id=int(data_params[4])
#print(sports_code,division,player_coach_id,academic_year,org_id)
data={
'sortOn': 0,
'doWhat': 'display',
'playerId': player_coach_id,
'coachId': player_coach_id,
'orgId': org_id,
'academicYear': academic_year,
'division':division,
'sportCode':sports_code,
'idx':''
}
url='http://web1.ncaa.org/stats/StatsSrv/careerteam'
r = s.post(url,data=data)
soup2=BeautifulSoup(r.text,'html.parser')
institution_name=soup2.find_all('table')[1].find_all('tr')[2].find_all('td')[1].text.strip()
capacity=soup2.find_all('table')[4].find_all('tr')[2].find_all('td')[1].text.strip()
#print([institution_name, capacity])
end_list.append([institution_name, capacity])
except IndexError:
pass
except AttributeError:
pass
#print(end_list)
headers=['School','Capacity']
df=pd.DataFrame(end_list, columns=headers)
print(df)
Output
School Capacity
0 Air Force 46,692
1 Akron 30,000
2 Alabama 101,821
3 Alabama A&M; 21,000
4 Alabama St. 26,500
5 Albany (NY) 8,500
6 Alcorn 22,500
7 Appalachian St. 30,000
8 Arizona 55,675
9 Arizona St. 64,248
10 Ark.-Pine Bluff 14,500
11 Arkansas 72,000
12 Arkansas St. 30,708
13 Army West Point 38,000
14 Auburn 87,451
15 Austin Peay 10,000
16 BYU 63,470
17 Ball St. 22,500
18 Baylor 45,140
19 Bethune-Cookman 9,601
20 Boise St. 36,387
21 Boston College 44,500
22 Bowling Green 24,000
23 Brown 20,000
24 Bucknell 13,100
25 Buffalo 29,013
26 Butler 5,647
27 Cal Poly 11,075
28 California 62,467
29 Central Conn. St. 5,500
.. ... ...
209 UCLA 91,136
210 UConn 40,000
211 UNI 16,324
212 UNLV 36,800
213 UT Martin 7,500
214 UTEP 52,000
215 Utah 45,807
216 Utah St. 25,100
217 VMI 10,000
218 Valparaiso 5,000
219 Vanderbilt 40,350
220 Villanova 12,000
221 Virginia 61,500
222 Virginia Tech 65,632
223 Wagner 3,300
224 Wake Forest 31,500
225 Washington 70,138
226 Washington St. 32,740
227 Weber St. 17,500
228 West Virginia 60,000
229 Western Caro. 13,742
230 Western Ill. 16,368
231 Western Ky. 22,113
232 Western Mich. 30,200
233 William & Mary 12,400
234 Wisconsin 80,321
235 Wofford 13,000
236 Wyoming 29,181
237 Yale 64,269
238 Youngstown St. 20,630
[239 rows x 2 columns]
Note:
This will take a long time. We are scrapping >239 pages. So be patient. Might take 15 mins or longer.
edited Jan 19 at 6:30
answered Jan 19 at 6:12
Bitto BennichanBitto Bennichan
2,1181120
2,1181120
Would it be possible for you to resolve this query :stackoverflow.com/questions/54279547/… I CANNOT find Params under Network data for this query
– Data_is_Power
Jan 20 at 23:27
@Data_is_Power I will try. I am not at home right now. I will let you know ASAP.
– Bitto Bennichan
Jan 20 at 23:29
Great. Thank you so much again! :)
– Data_is_Power
Jan 20 at 23:31
add a comment |
Would it be possible for you to resolve this query :stackoverflow.com/questions/54279547/… I CANNOT find Params under Network data for this query
– Data_is_Power
Jan 20 at 23:27
@Data_is_Power I will try. I am not at home right now. I will let you know ASAP.
– Bitto Bennichan
Jan 20 at 23:29
Great. Thank you so much again! :)
– Data_is_Power
Jan 20 at 23:31
Would it be possible for you to resolve this query :stackoverflow.com/questions/54279547/… I CANNOT find Params under Network data for this query
– Data_is_Power
Jan 20 at 23:27
Would it be possible for you to resolve this query :stackoverflow.com/questions/54279547/… I CANNOT find Params under Network data for this query
– Data_is_Power
Jan 20 at 23:27
@Data_is_Power I will try. I am not at home right now. I will let you know ASAP.
– Bitto Bennichan
Jan 20 at 23:29
@Data_is_Power I will try. I am not at home right now. I will let you know ASAP.
– Bitto Bennichan
Jan 20 at 23:29
Great. Thank you so much again! :)
– Data_is_Power
Jan 20 at 23:31
Great. Thank you so much again! :)
– Data_is_Power
Jan 20 at 23:31
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54263740%2fhow-to-parse-nested-table-from-html-link-using-beautifulsoup-in-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown