Brtdku

Question

All,

I am trying to Parse table from this link http://web1.ncaa.org/stats/StatsSrv/careersearch.
Please Note: For searching under "School/Sport Search" select All for School, Year -2005-2006, Sport -Football, Division I. The column I am trying to Parse is the School Names, and if you click on School Name.More information will output. From that link/Table I would like to Parse "Stadium Capacity" for each and every School. My question is Is something like this possible ? If yes,how ? I am new to python and BeautifulSoup, if you can provide explanation that will be Great!

Note: There are 239 results,

To Summarize: so Basically I would like to parse School Names along with their Stadium Capacity and convert it into Pandas Data-frame

import requests 

from bs4 import BeautifulSoup

URL = "http://web1.ncaa.org/stats/StatsSrv/careerteam"

r = requests.get(URL) 



soup = BeautifulSoup(r.content, 'html5lib') 

print(soup.prettify())

score 1 · Accepted Answer · 2019-01-19 06:30:21Z

My question is Is something like this possible ?

Yes.

If yes,how ?

There is a lot going in the code below. But the main point is to figure out the post requests being made by the browser and then emulate that using Requests. We can find out the request being made through the "network" tab in the inspect tool.

First we make the 'search' post request. This gives a left and right table. Clicking on the left table gives us the schools in that area. But if we observe carefully clicking on the area link also is a post request (which we have to do using requests)

Eg. Clicking on 'Air Force - Eastern Ill.' gives us a table containing the links of schools in that area. Then we have to go to that school link and figure out the capacity.

Since clicking on each of the school link is also a post request we have to emulate and this returns the school page. From here we scrap the school name and capacity.

You can read Advanced Usage of requests to know about Session objects, Making a request to read about making request with Requests.

import requests

from bs4 import BeautifulSoup

import pandas as pd

end_list=

s = requests.Session()

URL = "http://web1.ncaa.org/stats/StatsSrv/careersearch"

data={'doWhat': 'teamSearch','searchOrg': 'X', 'academicYear': 2006, 'searchSport':'MFB','searchDiv': 1}

r = s.post(URL,data=data)

soup=BeautifulSoup(r.text,'html.parser')

area_list=soup.find_all('table')[8].find_all('tr')

area_count=len(area_list)#has no of areas + 1  tr 'Total Results of Search:  239'

for idx in range(0,area_count):

    data={

    'sortOn': 0,

    'doWhat': 'showIdx',

    'playerId':'' ,'coachId': '',

    'orgId':'' ,

    'academicYear':'' ,

    'division':'' ,

    'sportCode':'' ,

    'idx': idx

    }

    r = s.post(URL,data=data)

    soup=BeautifulSoup(r.text,'html.parser')

    last_table=soup.find_all('table')[-1]#last table

    for tr in last_table.find_all('tr'):

        link_td=tr.find('td',class_="text")

        try:

            link_a=link_td.find('a')['href']

            data_params=link_a.split('(')[1][:-2].split(',')

            try:

                #print(data_params)

                sports_code=data_params[2].replace("'","").strip()

                division=int(data_params[3])

                player_coach_id=int(data_params[0])

                academic_year=int(data_params[1])

                org_id=int(data_params[4])

                #print(sports_code,division,player_coach_id,academic_year,org_id)

                data={

                'sortOn': 0,

                'doWhat': 'display',

                'playerId': player_coach_id,

                'coachId': player_coach_id,

                'orgId': org_id,

                'academicYear': academic_year,

                'division':division,

                'sportCode':sports_code,

                'idx':''

                }

                url='http://web1.ncaa.org/stats/StatsSrv/careerteam'

                r = s.post(url,data=data)

                soup2=BeautifulSoup(r.text,'html.parser')

                institution_name=soup2.find_all('table')[1].find_all('tr')[2].find_all('td')[1].text.strip()

                capacity=soup2.find_all('table')[4].find_all('tr')[2].find_all('td')[1].text.strip()

                #print([institution_name, capacity])

                end_list.append([institution_name, capacity])



            except IndexError:

                pass



        except AttributeError:

            pass

#print(end_list)

headers=['School','Capacity']

df=pd.DataFrame(end_list, columns=headers)

print(df)

Output

                School Capacity

0            Air Force   46,692

1                Akron   30,000

2              Alabama  101,821

3         Alabama A&M;   21,000

4          Alabama St.   26,500

5          Albany (NY)    8,500

6               Alcorn   22,500

7      Appalachian St.   30,000

8              Arizona   55,675

9          Arizona St.   64,248

10     Ark.-Pine Bluff   14,500

11            Arkansas   72,000

12        Arkansas St.   30,708

13     Army West Point   38,000

14              Auburn   87,451

15         Austin Peay   10,000

16                 BYU   63,470

17            Ball St.   22,500

18              Baylor   45,140

19     Bethune-Cookman    9,601

20           Boise St.   36,387

21      Boston College   44,500

22       Bowling Green   24,000

23               Brown   20,000

24            Bucknell   13,100

25             Buffalo   29,013

26              Butler    5,647

27            Cal Poly   11,075

28          California   62,467

29   Central Conn. St.    5,500

..                 ...      ...

209               UCLA   91,136

210              UConn   40,000

211                UNI   16,324

212               UNLV   36,800

213          UT Martin    7,500

214               UTEP   52,000

215               Utah   45,807

216           Utah St.   25,100

217                VMI   10,000

218         Valparaiso    5,000

219         Vanderbilt   40,350

220          Villanova   12,000

221           Virginia   61,500

222      Virginia Tech   65,632

223             Wagner    3,300

224        Wake Forest   31,500

225         Washington   70,138

226     Washington St.   32,740

227          Weber St.   17,500

228      West Virginia   60,000

229      Western Caro.   13,742

230       Western Ill.   16,368

231        Western Ky.   22,113

232      Western Mich.   30,200

233     William & Mary   12,400

234          Wisconsin   80,321

235            Wofford   13,000

236            Wyoming   29,181

237               Yale   64,269

238     Youngstown St.   20,630



[239 rows x 2 columns]

Note:
This will take a long time. We are scrapping >239 pages. So be patient. Might take 15 mins or longer.

Would it be possible for you to resolve this query :stackoverflow.com/questions/54279547/… I CANNOT find Params under Network data for this query — Jan 20 at 23:27
@Data_is_Power I will try. I am not at home right now. I will let you know ASAP. — Jan 20 at 23:29

score 1 · Accepted Answer · 2019-01-19 06:30:21Z

My question is Is something like this possible ?

Yes.

If yes,how ?

There is a lot going in the code below. But the main point is to figure out the post requests being made by the browser and then emulate that using Requests. We can find out the request being made through the "network" tab in the inspect tool.

First we make the 'search' post request. This gives a left and right table. Clicking on the left table gives us the schools in that area. But if we observe carefully clicking on the area link also is a post request (which we have to do using requests)

Eg. Clicking on 'Air Force - Eastern Ill.' gives us a table containing the links of schools in that area. Then we have to go to that school link and figure out the capacity.

Since clicking on each of the school link is also a post request we have to emulate and this returns the school page. From here we scrap the school name and capacity.

You can read Advanced Usage of requests to know about Session objects, Making a request to read about making request with Requests.

import requests

from bs4 import BeautifulSoup

import pandas as pd

end_list=

s = requests.Session()

URL = "http://web1.ncaa.org/stats/StatsSrv/careersearch"

data={'doWhat': 'teamSearch','searchOrg': 'X', 'academicYear': 2006, 'searchSport':'MFB','searchDiv': 1}

r = s.post(URL,data=data)

soup=BeautifulSoup(r.text,'html.parser')

area_list=soup.find_all('table')[8].find_all('tr')

area_count=len(area_list)#has no of areas + 1  tr 'Total Results of Search:  239'

for idx in range(0,area_count):

    data={

    'sortOn': 0,

    'doWhat': 'showIdx',

    'playerId':'' ,'coachId': '',

    'orgId':'' ,

    'academicYear':'' ,

    'division':'' ,

    'sportCode':'' ,

    'idx': idx

    }

    r = s.post(URL,data=data)

    soup=BeautifulSoup(r.text,'html.parser')

    last_table=soup.find_all('table')[-1]#last table

    for tr in last_table.find_all('tr'):

        link_td=tr.find('td',class_="text")

        try:

            link_a=link_td.find('a')['href']

            data_params=link_a.split('(')[1][:-2].split(',')

            try:

                #print(data_params)

                sports_code=data_params[2].replace("'","").strip()

                division=int(data_params[3])

                player_coach_id=int(data_params[0])

                academic_year=int(data_params[1])

                org_id=int(data_params[4])

                #print(sports_code,division,player_coach_id,academic_year,org_id)

                data={

                'sortOn': 0,

                'doWhat': 'display',

                'playerId': player_coach_id,

                'coachId': player_coach_id,

                'orgId': org_id,

                'academicYear': academic_year,

                'division':division,

                'sportCode':sports_code,

                'idx':''

                }

                url='http://web1.ncaa.org/stats/StatsSrv/careerteam'

                r = s.post(url,data=data)

                soup2=BeautifulSoup(r.text,'html.parser')

                institution_name=soup2.find_all('table')[1].find_all('tr')[2].find_all('td')[1].text.strip()

                capacity=soup2.find_all('table')[4].find_all('tr')[2].find_all('td')[1].text.strip()

                #print([institution_name, capacity])

                end_list.append([institution_name, capacity])



            except IndexError:

                pass



        except AttributeError:

            pass

#print(end_list)

headers=['School','Capacity']

df=pd.DataFrame(end_list, columns=headers)

print(df)

Output

                School Capacity

0            Air Force   46,692

1                Akron   30,000

2              Alabama  101,821

3         Alabama A&M;   21,000

4          Alabama St.   26,500

5          Albany (NY)    8,500

6               Alcorn   22,500

7      Appalachian St.   30,000

8              Arizona   55,675

9          Arizona St.   64,248

10     Ark.-Pine Bluff   14,500

11            Arkansas   72,000

12        Arkansas St.   30,708

13     Army West Point   38,000

14              Auburn   87,451

15         Austin Peay   10,000

16                 BYU   63,470

17            Ball St.   22,500

18              Baylor   45,140

19     Bethune-Cookman    9,601

20           Boise St.   36,387

21      Boston College   44,500

22       Bowling Green   24,000

23               Brown   20,000

24            Bucknell   13,100

25             Buffalo   29,013

26              Butler    5,647

27            Cal Poly   11,075

28          California   62,467

29   Central Conn. St.    5,500

..                 ...      ...

209               UCLA   91,136

210              UConn   40,000

211                UNI   16,324

212               UNLV   36,800

213          UT Martin    7,500

214               UTEP   52,000

215               Utah   45,807

216           Utah St.   25,100

217                VMI   10,000

218         Valparaiso    5,000

219         Vanderbilt   40,350

220          Villanova   12,000

221           Virginia   61,500

222      Virginia Tech   65,632

223             Wagner    3,300

224        Wake Forest   31,500

225         Washington   70,138

226     Washington St.   32,740

227          Weber St.   17,500

228      West Virginia   60,000

229      Western Caro.   13,742

230       Western Ill.   16,368

231        Western Ky.   22,113

232      Western Mich.   30,200

233     William & Mary   12,400

234          Wisconsin   80,321

235            Wofford   13,000

236            Wyoming   29,181

237               Yale   64,269

238     Youngstown St.   20,630



[239 rows x 2 columns]

Note:
This will take a long time. We are scrapping >239 pages. So be patient. Might take 15 mins or longer.

Would it be possible for you to resolve this query :stackoverflow.com/questions/54279547/… I CANNOT find Params under Network data for this query — Jan 20 at 23:27
@Data_is_Power I will try. I am not at home right now. I will let you know ASAP. — Jan 20 at 23:29

搜尋此網誌

Brtdku

How to parse nested table from HTML link using BeautifulSoup in Python?

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Liquibase includeAll doesn't find base path

How to use setInterval in EJS file?

Petrus Granier-Deferre

How to parse nested table from HTML link using BeautifulSoup in Python?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Liquibase includeAll doesn't find base path

How to use setInterval in EJS file?

Petrus Granier-Deferre

1 Answer
1

1 Answer
1

1 Answer
1