How to parse nested table from HTML link using BeautifulSoup in Python?












2















All,



I am trying to Parse table from this link http://web1.ncaa.org/stats/StatsSrv/careersearch.
Please Note: For searching under "School/Sport Search" select All for School, Year -2005-2006, Sport -Football, Division I. The column I am trying to Parse is the School Names, and if you click on School Name.More information will output. From that link/Table I would like to Parse "Stadium Capacity" for each and every School. My question is Is something like this possible ? If yes,how ? I am new to python and BeautifulSoup, if you can provide explanation that will be Great!



Note: There are 239 results,



To Summarize: so Basically I would like to parse School Names along with their Stadium Capacity and convert it into Pandas Data-frame



import requests 
from bs4 import BeautifulSoup
URL = "http://web1.ncaa.org/stats/StatsSrv/careerteam"
r = requests.get(URL)

soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())









share|improve this question



























    2















    All,



    I am trying to Parse table from this link http://web1.ncaa.org/stats/StatsSrv/careersearch.
    Please Note: For searching under "School/Sport Search" select All for School, Year -2005-2006, Sport -Football, Division I. The column I am trying to Parse is the School Names, and if you click on School Name.More information will output. From that link/Table I would like to Parse "Stadium Capacity" for each and every School. My question is Is something like this possible ? If yes,how ? I am new to python and BeautifulSoup, if you can provide explanation that will be Great!



    Note: There are 239 results,



    To Summarize: so Basically I would like to parse School Names along with their Stadium Capacity and convert it into Pandas Data-frame



    import requests 
    from bs4 import BeautifulSoup
    URL = "http://web1.ncaa.org/stats/StatsSrv/careerteam"
    r = requests.get(URL)

    soup = BeautifulSoup(r.content, 'html5lib')
    print(soup.prettify())









    share|improve this question

























      2












      2








      2








      All,



      I am trying to Parse table from this link http://web1.ncaa.org/stats/StatsSrv/careersearch.
      Please Note: For searching under "School/Sport Search" select All for School, Year -2005-2006, Sport -Football, Division I. The column I am trying to Parse is the School Names, and if you click on School Name.More information will output. From that link/Table I would like to Parse "Stadium Capacity" for each and every School. My question is Is something like this possible ? If yes,how ? I am new to python and BeautifulSoup, if you can provide explanation that will be Great!



      Note: There are 239 results,



      To Summarize: so Basically I would like to parse School Names along with their Stadium Capacity and convert it into Pandas Data-frame



      import requests 
      from bs4 import BeautifulSoup
      URL = "http://web1.ncaa.org/stats/StatsSrv/careerteam"
      r = requests.get(URL)

      soup = BeautifulSoup(r.content, 'html5lib')
      print(soup.prettify())









      share|improve this question














      All,



      I am trying to Parse table from this link http://web1.ncaa.org/stats/StatsSrv/careersearch.
      Please Note: For searching under "School/Sport Search" select All for School, Year -2005-2006, Sport -Football, Division I. The column I am trying to Parse is the School Names, and if you click on School Name.More information will output. From that link/Table I would like to Parse "Stadium Capacity" for each and every School. My question is Is something like this possible ? If yes,how ? I am new to python and BeautifulSoup, if you can provide explanation that will be Great!



      Note: There are 239 results,



      To Summarize: so Basically I would like to parse School Names along with their Stadium Capacity and convert it into Pandas Data-frame



      import requests 
      from bs4 import BeautifulSoup
      URL = "http://web1.ncaa.org/stats/StatsSrv/careerteam"
      r = requests.get(URL)

      soup = BeautifulSoup(r.content, 'html5lib')
      print(soup.prettify())






      python-3.x pandas beautifulsoup html-parsing html-parser






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Jan 19 at 3:09









      Data_is_PowerData_is_Power

      1528




      1528
























          1 Answer
          1






          active

          oldest

          votes


















          1















          My question is Is something like this possible ?




          Yes.




          If yes,how ?




          There is a lot going in the code below. But the main point is to figure out the post requests being made by the browser and then emulate that using Requests. We can find out the request being made through the "network" tab in the inspect tool.



          First we make the 'search' post request. This gives a left and right table. Clicking on the left table gives us the schools in that area. But if we observe carefully clicking on the area link also is a post request (which we have to do using requests)



          Eg. Clicking on 'Air Force - Eastern Ill.' gives us a table containing the links of schools in that area. Then we have to go to that school link and figure out the capacity.



          Since clicking on each of the school link is also a post request we have to emulate and this returns the school page. From here we scrap the school name and capacity.



          You can read Advanced Usage of requests to know about Session objects, Making a request to read about making request with Requests.



          import requests
          from bs4 import BeautifulSoup
          import pandas as pd
          end_list=
          s = requests.Session()
          URL = "http://web1.ncaa.org/stats/StatsSrv/careersearch"
          data={'doWhat': 'teamSearch','searchOrg': 'X', 'academicYear': 2006, 'searchSport':'MFB','searchDiv': 1}
          r = s.post(URL,data=data)
          soup=BeautifulSoup(r.text,'html.parser')
          area_list=soup.find_all('table')[8].find_all('tr')
          area_count=len(area_list)#has no of areas + 1 tr 'Total Results of Search: 239'
          for idx in range(0,area_count):
          data={
          'sortOn': 0,
          'doWhat': 'showIdx',
          'playerId':'' ,'coachId': '',
          'orgId':'' ,
          'academicYear':'' ,
          'division':'' ,
          'sportCode':'' ,
          'idx': idx
          }
          r = s.post(URL,data=data)
          soup=BeautifulSoup(r.text,'html.parser')
          last_table=soup.find_all('table')[-1]#last table
          for tr in last_table.find_all('tr'):
          link_td=tr.find('td',class_="text")
          try:
          link_a=link_td.find('a')['href']
          data_params=link_a.split('(')[1][:-2].split(',')
          try:
          #print(data_params)
          sports_code=data_params[2].replace("'","").strip()
          division=int(data_params[3])
          player_coach_id=int(data_params[0])
          academic_year=int(data_params[1])
          org_id=int(data_params[4])
          #print(sports_code,division,player_coach_id,academic_year,org_id)
          data={
          'sortOn': 0,
          'doWhat': 'display',
          'playerId': player_coach_id,
          'coachId': player_coach_id,
          'orgId': org_id,
          'academicYear': academic_year,
          'division':division,
          'sportCode':sports_code,
          'idx':''
          }
          url='http://web1.ncaa.org/stats/StatsSrv/careerteam'
          r = s.post(url,data=data)
          soup2=BeautifulSoup(r.text,'html.parser')
          institution_name=soup2.find_all('table')[1].find_all('tr')[2].find_all('td')[1].text.strip()
          capacity=soup2.find_all('table')[4].find_all('tr')[2].find_all('td')[1].text.strip()
          #print([institution_name, capacity])
          end_list.append([institution_name, capacity])

          except IndexError:
          pass

          except AttributeError:
          pass
          #print(end_list)
          headers=['School','Capacity']
          df=pd.DataFrame(end_list, columns=headers)
          print(df)


          Output



                          School Capacity
          0 Air Force 46,692
          1 Akron 30,000
          2 Alabama 101,821
          3 Alabama A&M; 21,000
          4 Alabama St. 26,500
          5 Albany (NY) 8,500
          6 Alcorn 22,500
          7 Appalachian St. 30,000
          8 Arizona 55,675
          9 Arizona St. 64,248
          10 Ark.-Pine Bluff 14,500
          11 Arkansas 72,000
          12 Arkansas St. 30,708
          13 Army West Point 38,000
          14 Auburn 87,451
          15 Austin Peay 10,000
          16 BYU 63,470
          17 Ball St. 22,500
          18 Baylor 45,140
          19 Bethune-Cookman 9,601
          20 Boise St. 36,387
          21 Boston College 44,500
          22 Bowling Green 24,000
          23 Brown 20,000
          24 Bucknell 13,100
          25 Buffalo 29,013
          26 Butler 5,647
          27 Cal Poly 11,075
          28 California 62,467
          29 Central Conn. St. 5,500
          .. ... ...
          209 UCLA 91,136
          210 UConn 40,000
          211 UNI 16,324
          212 UNLV 36,800
          213 UT Martin 7,500
          214 UTEP 52,000
          215 Utah 45,807
          216 Utah St. 25,100
          217 VMI 10,000
          218 Valparaiso 5,000
          219 Vanderbilt 40,350
          220 Villanova 12,000
          221 Virginia 61,500
          222 Virginia Tech 65,632
          223 Wagner 3,300
          224 Wake Forest 31,500
          225 Washington 70,138
          226 Washington St. 32,740
          227 Weber St. 17,500
          228 West Virginia 60,000
          229 Western Caro. 13,742
          230 Western Ill. 16,368
          231 Western Ky. 22,113
          232 Western Mich. 30,200
          233 William & Mary 12,400
          234 Wisconsin 80,321
          235 Wofford 13,000
          236 Wyoming 29,181
          237 Yale 64,269
          238 Youngstown St. 20,630

          [239 rows x 2 columns]


          Note:
          This will take a long time. We are scrapping >239 pages. So be patient. Might take 15 mins or longer.






          share|improve this answer


























          • Would it be possible for you to resolve this query :stackoverflow.com/questions/54279547/… I CANNOT find Params under Network data for this query

            – Data_is_Power
            Jan 20 at 23:27











          • @Data_is_Power I will try. I am not at home right now. I will let you know ASAP.

            – Bitto Bennichan
            Jan 20 at 23:29











          • Great. Thank you so much again! :)

            – Data_is_Power
            Jan 20 at 23:31











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54263740%2fhow-to-parse-nested-table-from-html-link-using-beautifulsoup-in-python%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1















          My question is Is something like this possible ?




          Yes.




          If yes,how ?




          There is a lot going in the code below. But the main point is to figure out the post requests being made by the browser and then emulate that using Requests. We can find out the request being made through the "network" tab in the inspect tool.



          First we make the 'search' post request. This gives a left and right table. Clicking on the left table gives us the schools in that area. But if we observe carefully clicking on the area link also is a post request (which we have to do using requests)



          Eg. Clicking on 'Air Force - Eastern Ill.' gives us a table containing the links of schools in that area. Then we have to go to that school link and figure out the capacity.



          Since clicking on each of the school link is also a post request we have to emulate and this returns the school page. From here we scrap the school name and capacity.



          You can read Advanced Usage of requests to know about Session objects, Making a request to read about making request with Requests.



          import requests
          from bs4 import BeautifulSoup
          import pandas as pd
          end_list=
          s = requests.Session()
          URL = "http://web1.ncaa.org/stats/StatsSrv/careersearch"
          data={'doWhat': 'teamSearch','searchOrg': 'X', 'academicYear': 2006, 'searchSport':'MFB','searchDiv': 1}
          r = s.post(URL,data=data)
          soup=BeautifulSoup(r.text,'html.parser')
          area_list=soup.find_all('table')[8].find_all('tr')
          area_count=len(area_list)#has no of areas + 1 tr 'Total Results of Search: 239'
          for idx in range(0,area_count):
          data={
          'sortOn': 0,
          'doWhat': 'showIdx',
          'playerId':'' ,'coachId': '',
          'orgId':'' ,
          'academicYear':'' ,
          'division':'' ,
          'sportCode':'' ,
          'idx': idx
          }
          r = s.post(URL,data=data)
          soup=BeautifulSoup(r.text,'html.parser')
          last_table=soup.find_all('table')[-1]#last table
          for tr in last_table.find_all('tr'):
          link_td=tr.find('td',class_="text")
          try:
          link_a=link_td.find('a')['href']
          data_params=link_a.split('(')[1][:-2].split(',')
          try:
          #print(data_params)
          sports_code=data_params[2].replace("'","").strip()
          division=int(data_params[3])
          player_coach_id=int(data_params[0])
          academic_year=int(data_params[1])
          org_id=int(data_params[4])
          #print(sports_code,division,player_coach_id,academic_year,org_id)
          data={
          'sortOn': 0,
          'doWhat': 'display',
          'playerId': player_coach_id,
          'coachId': player_coach_id,
          'orgId': org_id,
          'academicYear': academic_year,
          'division':division,
          'sportCode':sports_code,
          'idx':''
          }
          url='http://web1.ncaa.org/stats/StatsSrv/careerteam'
          r = s.post(url,data=data)
          soup2=BeautifulSoup(r.text,'html.parser')
          institution_name=soup2.find_all('table')[1].find_all('tr')[2].find_all('td')[1].text.strip()
          capacity=soup2.find_all('table')[4].find_all('tr')[2].find_all('td')[1].text.strip()
          #print([institution_name, capacity])
          end_list.append([institution_name, capacity])

          except IndexError:
          pass

          except AttributeError:
          pass
          #print(end_list)
          headers=['School','Capacity']
          df=pd.DataFrame(end_list, columns=headers)
          print(df)


          Output



                          School Capacity
          0 Air Force 46,692
          1 Akron 30,000
          2 Alabama 101,821
          3 Alabama A&M; 21,000
          4 Alabama St. 26,500
          5 Albany (NY) 8,500
          6 Alcorn 22,500
          7 Appalachian St. 30,000
          8 Arizona 55,675
          9 Arizona St. 64,248
          10 Ark.-Pine Bluff 14,500
          11 Arkansas 72,000
          12 Arkansas St. 30,708
          13 Army West Point 38,000
          14 Auburn 87,451
          15 Austin Peay 10,000
          16 BYU 63,470
          17 Ball St. 22,500
          18 Baylor 45,140
          19 Bethune-Cookman 9,601
          20 Boise St. 36,387
          21 Boston College 44,500
          22 Bowling Green 24,000
          23 Brown 20,000
          24 Bucknell 13,100
          25 Buffalo 29,013
          26 Butler 5,647
          27 Cal Poly 11,075
          28 California 62,467
          29 Central Conn. St. 5,500
          .. ... ...
          209 UCLA 91,136
          210 UConn 40,000
          211 UNI 16,324
          212 UNLV 36,800
          213 UT Martin 7,500
          214 UTEP 52,000
          215 Utah 45,807
          216 Utah St. 25,100
          217 VMI 10,000
          218 Valparaiso 5,000
          219 Vanderbilt 40,350
          220 Villanova 12,000
          221 Virginia 61,500
          222 Virginia Tech 65,632
          223 Wagner 3,300
          224 Wake Forest 31,500
          225 Washington 70,138
          226 Washington St. 32,740
          227 Weber St. 17,500
          228 West Virginia 60,000
          229 Western Caro. 13,742
          230 Western Ill. 16,368
          231 Western Ky. 22,113
          232 Western Mich. 30,200
          233 William & Mary 12,400
          234 Wisconsin 80,321
          235 Wofford 13,000
          236 Wyoming 29,181
          237 Yale 64,269
          238 Youngstown St. 20,630

          [239 rows x 2 columns]


          Note:
          This will take a long time. We are scrapping >239 pages. So be patient. Might take 15 mins or longer.






          share|improve this answer


























          • Would it be possible for you to resolve this query :stackoverflow.com/questions/54279547/… I CANNOT find Params under Network data for this query

            – Data_is_Power
            Jan 20 at 23:27











          • @Data_is_Power I will try. I am not at home right now. I will let you know ASAP.

            – Bitto Bennichan
            Jan 20 at 23:29











          • Great. Thank you so much again! :)

            – Data_is_Power
            Jan 20 at 23:31
















          1















          My question is Is something like this possible ?




          Yes.




          If yes,how ?




          There is a lot going in the code below. But the main point is to figure out the post requests being made by the browser and then emulate that using Requests. We can find out the request being made through the "network" tab in the inspect tool.



          First we make the 'search' post request. This gives a left and right table. Clicking on the left table gives us the schools in that area. But if we observe carefully clicking on the area link also is a post request (which we have to do using requests)



          Eg. Clicking on 'Air Force - Eastern Ill.' gives us a table containing the links of schools in that area. Then we have to go to that school link and figure out the capacity.



          Since clicking on each of the school link is also a post request we have to emulate and this returns the school page. From here we scrap the school name and capacity.



          You can read Advanced Usage of requests to know about Session objects, Making a request to read about making request with Requests.



          import requests
          from bs4 import BeautifulSoup
          import pandas as pd
          end_list=
          s = requests.Session()
          URL = "http://web1.ncaa.org/stats/StatsSrv/careersearch"
          data={'doWhat': 'teamSearch','searchOrg': 'X', 'academicYear': 2006, 'searchSport':'MFB','searchDiv': 1}
          r = s.post(URL,data=data)
          soup=BeautifulSoup(r.text,'html.parser')
          area_list=soup.find_all('table')[8].find_all('tr')
          area_count=len(area_list)#has no of areas + 1 tr 'Total Results of Search: 239'
          for idx in range(0,area_count):
          data={
          'sortOn': 0,
          'doWhat': 'showIdx',
          'playerId':'' ,'coachId': '',
          'orgId':'' ,
          'academicYear':'' ,
          'division':'' ,
          'sportCode':'' ,
          'idx': idx
          }
          r = s.post(URL,data=data)
          soup=BeautifulSoup(r.text,'html.parser')
          last_table=soup.find_all('table')[-1]#last table
          for tr in last_table.find_all('tr'):
          link_td=tr.find('td',class_="text")
          try:
          link_a=link_td.find('a')['href']
          data_params=link_a.split('(')[1][:-2].split(',')
          try:
          #print(data_params)
          sports_code=data_params[2].replace("'","").strip()
          division=int(data_params[3])
          player_coach_id=int(data_params[0])
          academic_year=int(data_params[1])
          org_id=int(data_params[4])
          #print(sports_code,division,player_coach_id,academic_year,org_id)
          data={
          'sortOn': 0,
          'doWhat': 'display',
          'playerId': player_coach_id,
          'coachId': player_coach_id,
          'orgId': org_id,
          'academicYear': academic_year,
          'division':division,
          'sportCode':sports_code,
          'idx':''
          }
          url='http://web1.ncaa.org/stats/StatsSrv/careerteam'
          r = s.post(url,data=data)
          soup2=BeautifulSoup(r.text,'html.parser')
          institution_name=soup2.find_all('table')[1].find_all('tr')[2].find_all('td')[1].text.strip()
          capacity=soup2.find_all('table')[4].find_all('tr')[2].find_all('td')[1].text.strip()
          #print([institution_name, capacity])
          end_list.append([institution_name, capacity])

          except IndexError:
          pass

          except AttributeError:
          pass
          #print(end_list)
          headers=['School','Capacity']
          df=pd.DataFrame(end_list, columns=headers)
          print(df)


          Output



                          School Capacity
          0 Air Force 46,692
          1 Akron 30,000
          2 Alabama 101,821
          3 Alabama A&M; 21,000
          4 Alabama St. 26,500
          5 Albany (NY) 8,500
          6 Alcorn 22,500
          7 Appalachian St. 30,000
          8 Arizona 55,675
          9 Arizona St. 64,248
          10 Ark.-Pine Bluff 14,500
          11 Arkansas 72,000
          12 Arkansas St. 30,708
          13 Army West Point 38,000
          14 Auburn 87,451
          15 Austin Peay 10,000
          16 BYU 63,470
          17 Ball St. 22,500
          18 Baylor 45,140
          19 Bethune-Cookman 9,601
          20 Boise St. 36,387
          21 Boston College 44,500
          22 Bowling Green 24,000
          23 Brown 20,000
          24 Bucknell 13,100
          25 Buffalo 29,013
          26 Butler 5,647
          27 Cal Poly 11,075
          28 California 62,467
          29 Central Conn. St. 5,500
          .. ... ...
          209 UCLA 91,136
          210 UConn 40,000
          211 UNI 16,324
          212 UNLV 36,800
          213 UT Martin 7,500
          214 UTEP 52,000
          215 Utah 45,807
          216 Utah St. 25,100
          217 VMI 10,000
          218 Valparaiso 5,000
          219 Vanderbilt 40,350
          220 Villanova 12,000
          221 Virginia 61,500
          222 Virginia Tech 65,632
          223 Wagner 3,300
          224 Wake Forest 31,500
          225 Washington 70,138
          226 Washington St. 32,740
          227 Weber St. 17,500
          228 West Virginia 60,000
          229 Western Caro. 13,742
          230 Western Ill. 16,368
          231 Western Ky. 22,113
          232 Western Mich. 30,200
          233 William & Mary 12,400
          234 Wisconsin 80,321
          235 Wofford 13,000
          236 Wyoming 29,181
          237 Yale 64,269
          238 Youngstown St. 20,630

          [239 rows x 2 columns]


          Note:
          This will take a long time. We are scrapping >239 pages. So be patient. Might take 15 mins or longer.






          share|improve this answer


























          • Would it be possible for you to resolve this query :stackoverflow.com/questions/54279547/… I CANNOT find Params under Network data for this query

            – Data_is_Power
            Jan 20 at 23:27











          • @Data_is_Power I will try. I am not at home right now. I will let you know ASAP.

            – Bitto Bennichan
            Jan 20 at 23:29











          • Great. Thank you so much again! :)

            – Data_is_Power
            Jan 20 at 23:31














          1












          1








          1








          My question is Is something like this possible ?




          Yes.




          If yes,how ?




          There is a lot going in the code below. But the main point is to figure out the post requests being made by the browser and then emulate that using Requests. We can find out the request being made through the "network" tab in the inspect tool.



          First we make the 'search' post request. This gives a left and right table. Clicking on the left table gives us the schools in that area. But if we observe carefully clicking on the area link also is a post request (which we have to do using requests)



          Eg. Clicking on 'Air Force - Eastern Ill.' gives us a table containing the links of schools in that area. Then we have to go to that school link and figure out the capacity.



          Since clicking on each of the school link is also a post request we have to emulate and this returns the school page. From here we scrap the school name and capacity.



          You can read Advanced Usage of requests to know about Session objects, Making a request to read about making request with Requests.



          import requests
          from bs4 import BeautifulSoup
          import pandas as pd
          end_list=
          s = requests.Session()
          URL = "http://web1.ncaa.org/stats/StatsSrv/careersearch"
          data={'doWhat': 'teamSearch','searchOrg': 'X', 'academicYear': 2006, 'searchSport':'MFB','searchDiv': 1}
          r = s.post(URL,data=data)
          soup=BeautifulSoup(r.text,'html.parser')
          area_list=soup.find_all('table')[8].find_all('tr')
          area_count=len(area_list)#has no of areas + 1 tr 'Total Results of Search: 239'
          for idx in range(0,area_count):
          data={
          'sortOn': 0,
          'doWhat': 'showIdx',
          'playerId':'' ,'coachId': '',
          'orgId':'' ,
          'academicYear':'' ,
          'division':'' ,
          'sportCode':'' ,
          'idx': idx
          }
          r = s.post(URL,data=data)
          soup=BeautifulSoup(r.text,'html.parser')
          last_table=soup.find_all('table')[-1]#last table
          for tr in last_table.find_all('tr'):
          link_td=tr.find('td',class_="text")
          try:
          link_a=link_td.find('a')['href']
          data_params=link_a.split('(')[1][:-2].split(',')
          try:
          #print(data_params)
          sports_code=data_params[2].replace("'","").strip()
          division=int(data_params[3])
          player_coach_id=int(data_params[0])
          academic_year=int(data_params[1])
          org_id=int(data_params[4])
          #print(sports_code,division,player_coach_id,academic_year,org_id)
          data={
          'sortOn': 0,
          'doWhat': 'display',
          'playerId': player_coach_id,
          'coachId': player_coach_id,
          'orgId': org_id,
          'academicYear': academic_year,
          'division':division,
          'sportCode':sports_code,
          'idx':''
          }
          url='http://web1.ncaa.org/stats/StatsSrv/careerteam'
          r = s.post(url,data=data)
          soup2=BeautifulSoup(r.text,'html.parser')
          institution_name=soup2.find_all('table')[1].find_all('tr')[2].find_all('td')[1].text.strip()
          capacity=soup2.find_all('table')[4].find_all('tr')[2].find_all('td')[1].text.strip()
          #print([institution_name, capacity])
          end_list.append([institution_name, capacity])

          except IndexError:
          pass

          except AttributeError:
          pass
          #print(end_list)
          headers=['School','Capacity']
          df=pd.DataFrame(end_list, columns=headers)
          print(df)


          Output



                          School Capacity
          0 Air Force 46,692
          1 Akron 30,000
          2 Alabama 101,821
          3 Alabama A&M; 21,000
          4 Alabama St. 26,500
          5 Albany (NY) 8,500
          6 Alcorn 22,500
          7 Appalachian St. 30,000
          8 Arizona 55,675
          9 Arizona St. 64,248
          10 Ark.-Pine Bluff 14,500
          11 Arkansas 72,000
          12 Arkansas St. 30,708
          13 Army West Point 38,000
          14 Auburn 87,451
          15 Austin Peay 10,000
          16 BYU 63,470
          17 Ball St. 22,500
          18 Baylor 45,140
          19 Bethune-Cookman 9,601
          20 Boise St. 36,387
          21 Boston College 44,500
          22 Bowling Green 24,000
          23 Brown 20,000
          24 Bucknell 13,100
          25 Buffalo 29,013
          26 Butler 5,647
          27 Cal Poly 11,075
          28 California 62,467
          29 Central Conn. St. 5,500
          .. ... ...
          209 UCLA 91,136
          210 UConn 40,000
          211 UNI 16,324
          212 UNLV 36,800
          213 UT Martin 7,500
          214 UTEP 52,000
          215 Utah 45,807
          216 Utah St. 25,100
          217 VMI 10,000
          218 Valparaiso 5,000
          219 Vanderbilt 40,350
          220 Villanova 12,000
          221 Virginia 61,500
          222 Virginia Tech 65,632
          223 Wagner 3,300
          224 Wake Forest 31,500
          225 Washington 70,138
          226 Washington St. 32,740
          227 Weber St. 17,500
          228 West Virginia 60,000
          229 Western Caro. 13,742
          230 Western Ill. 16,368
          231 Western Ky. 22,113
          232 Western Mich. 30,200
          233 William & Mary 12,400
          234 Wisconsin 80,321
          235 Wofford 13,000
          236 Wyoming 29,181
          237 Yale 64,269
          238 Youngstown St. 20,630

          [239 rows x 2 columns]


          Note:
          This will take a long time. We are scrapping >239 pages. So be patient. Might take 15 mins or longer.






          share|improve this answer
















          My question is Is something like this possible ?




          Yes.




          If yes,how ?




          There is a lot going in the code below. But the main point is to figure out the post requests being made by the browser and then emulate that using Requests. We can find out the request being made through the "network" tab in the inspect tool.



          First we make the 'search' post request. This gives a left and right table. Clicking on the left table gives us the schools in that area. But if we observe carefully clicking on the area link also is a post request (which we have to do using requests)



          Eg. Clicking on 'Air Force - Eastern Ill.' gives us a table containing the links of schools in that area. Then we have to go to that school link and figure out the capacity.



          Since clicking on each of the school link is also a post request we have to emulate and this returns the school page. From here we scrap the school name and capacity.



          You can read Advanced Usage of requests to know about Session objects, Making a request to read about making request with Requests.



          import requests
          from bs4 import BeautifulSoup
          import pandas as pd
          end_list=
          s = requests.Session()
          URL = "http://web1.ncaa.org/stats/StatsSrv/careersearch"
          data={'doWhat': 'teamSearch','searchOrg': 'X', 'academicYear': 2006, 'searchSport':'MFB','searchDiv': 1}
          r = s.post(URL,data=data)
          soup=BeautifulSoup(r.text,'html.parser')
          area_list=soup.find_all('table')[8].find_all('tr')
          area_count=len(area_list)#has no of areas + 1 tr 'Total Results of Search: 239'
          for idx in range(0,area_count):
          data={
          'sortOn': 0,
          'doWhat': 'showIdx',
          'playerId':'' ,'coachId': '',
          'orgId':'' ,
          'academicYear':'' ,
          'division':'' ,
          'sportCode':'' ,
          'idx': idx
          }
          r = s.post(URL,data=data)
          soup=BeautifulSoup(r.text,'html.parser')
          last_table=soup.find_all('table')[-1]#last table
          for tr in last_table.find_all('tr'):
          link_td=tr.find('td',class_="text")
          try:
          link_a=link_td.find('a')['href']
          data_params=link_a.split('(')[1][:-2].split(',')
          try:
          #print(data_params)
          sports_code=data_params[2].replace("'","").strip()
          division=int(data_params[3])
          player_coach_id=int(data_params[0])
          academic_year=int(data_params[1])
          org_id=int(data_params[4])
          #print(sports_code,division,player_coach_id,academic_year,org_id)
          data={
          'sortOn': 0,
          'doWhat': 'display',
          'playerId': player_coach_id,
          'coachId': player_coach_id,
          'orgId': org_id,
          'academicYear': academic_year,
          'division':division,
          'sportCode':sports_code,
          'idx':''
          }
          url='http://web1.ncaa.org/stats/StatsSrv/careerteam'
          r = s.post(url,data=data)
          soup2=BeautifulSoup(r.text,'html.parser')
          institution_name=soup2.find_all('table')[1].find_all('tr')[2].find_all('td')[1].text.strip()
          capacity=soup2.find_all('table')[4].find_all('tr')[2].find_all('td')[1].text.strip()
          #print([institution_name, capacity])
          end_list.append([institution_name, capacity])

          except IndexError:
          pass

          except AttributeError:
          pass
          #print(end_list)
          headers=['School','Capacity']
          df=pd.DataFrame(end_list, columns=headers)
          print(df)


          Output



                          School Capacity
          0 Air Force 46,692
          1 Akron 30,000
          2 Alabama 101,821
          3 Alabama A&M; 21,000
          4 Alabama St. 26,500
          5 Albany (NY) 8,500
          6 Alcorn 22,500
          7 Appalachian St. 30,000
          8 Arizona 55,675
          9 Arizona St. 64,248
          10 Ark.-Pine Bluff 14,500
          11 Arkansas 72,000
          12 Arkansas St. 30,708
          13 Army West Point 38,000
          14 Auburn 87,451
          15 Austin Peay 10,000
          16 BYU 63,470
          17 Ball St. 22,500
          18 Baylor 45,140
          19 Bethune-Cookman 9,601
          20 Boise St. 36,387
          21 Boston College 44,500
          22 Bowling Green 24,000
          23 Brown 20,000
          24 Bucknell 13,100
          25 Buffalo 29,013
          26 Butler 5,647
          27 Cal Poly 11,075
          28 California 62,467
          29 Central Conn. St. 5,500
          .. ... ...
          209 UCLA 91,136
          210 UConn 40,000
          211 UNI 16,324
          212 UNLV 36,800
          213 UT Martin 7,500
          214 UTEP 52,000
          215 Utah 45,807
          216 Utah St. 25,100
          217 VMI 10,000
          218 Valparaiso 5,000
          219 Vanderbilt 40,350
          220 Villanova 12,000
          221 Virginia 61,500
          222 Virginia Tech 65,632
          223 Wagner 3,300
          224 Wake Forest 31,500
          225 Washington 70,138
          226 Washington St. 32,740
          227 Weber St. 17,500
          228 West Virginia 60,000
          229 Western Caro. 13,742
          230 Western Ill. 16,368
          231 Western Ky. 22,113
          232 Western Mich. 30,200
          233 William & Mary 12,400
          234 Wisconsin 80,321
          235 Wofford 13,000
          236 Wyoming 29,181
          237 Yale 64,269
          238 Youngstown St. 20,630

          [239 rows x 2 columns]


          Note:
          This will take a long time. We are scrapping >239 pages. So be patient. Might take 15 mins or longer.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Jan 19 at 6:30

























          answered Jan 19 at 6:12









          Bitto BennichanBitto Bennichan

          2,1181120




          2,1181120













          • Would it be possible for you to resolve this query :stackoverflow.com/questions/54279547/… I CANNOT find Params under Network data for this query

            – Data_is_Power
            Jan 20 at 23:27











          • @Data_is_Power I will try. I am not at home right now. I will let you know ASAP.

            – Bitto Bennichan
            Jan 20 at 23:29











          • Great. Thank you so much again! :)

            – Data_is_Power
            Jan 20 at 23:31



















          • Would it be possible for you to resolve this query :stackoverflow.com/questions/54279547/… I CANNOT find Params under Network data for this query

            – Data_is_Power
            Jan 20 at 23:27











          • @Data_is_Power I will try. I am not at home right now. I will let you know ASAP.

            – Bitto Bennichan
            Jan 20 at 23:29











          • Great. Thank you so much again! :)

            – Data_is_Power
            Jan 20 at 23:31

















          Would it be possible for you to resolve this query :stackoverflow.com/questions/54279547/… I CANNOT find Params under Network data for this query

          – Data_is_Power
          Jan 20 at 23:27





          Would it be possible for you to resolve this query :stackoverflow.com/questions/54279547/… I CANNOT find Params under Network data for this query

          – Data_is_Power
          Jan 20 at 23:27













          @Data_is_Power I will try. I am not at home right now. I will let you know ASAP.

          – Bitto Bennichan
          Jan 20 at 23:29





          @Data_is_Power I will try. I am not at home right now. I will let you know ASAP.

          – Bitto Bennichan
          Jan 20 at 23:29













          Great. Thank you so much again! :)

          – Data_is_Power
          Jan 20 at 23:31





          Great. Thank you so much again! :)

          – Data_is_Power
          Jan 20 at 23:31


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54263740%2fhow-to-parse-nested-table-from-html-link-using-beautifulsoup-in-python%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Liquibase includeAll doesn't find base path

          How to use setInterval in EJS file?

          Petrus Granier-Deferre