Iterieren über mehrere URLs mit BS4 - Parser

tarifa · 27. April 2022

Hallo und guten Abend,

es geht um einen Parser der Infos aus einer HTML-Seite ziehen will. Insgesamt hab ich einige dieser Seiten

vgl. hier: https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool

hier also ein Auszug der URLs

Python:

'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view',
'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/17865/view',
'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1416/view'

Aber zunächst geht es mal um das Parsen einer Seite - und zwar will ich die jeweilige Description des Hubs herausholen:

Mein Ansatz:

Python:

import requests
from bs4 import BeautifulSoup
from pprint import pprint

url = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html5lib')

# The Name des hubs ist in dem <h2> tag.
# viele weitern Infos des hubs - die sind  in einem <h2> tag.

hubname = soup.find('h4').text

# die description info  - sie befindet sich in einem <div class='hubCardContent'>.

description = soup.find("div", class_="hubCardContent")

cardinfo = {}

#  Herausholen der <p> tags die innerhalb  des div. the infoLabel class sich befinden
for data in contact.find_all('p'):
    if 'infoLabel' in data.attrs.get('class', []):
        title = data.text
        cardinfo[title] = []
    else:
        cardinfo[title].append( data.text )

# Die Kontakt-Info - sie befindet sich in einem a <div> innerhalb eines infoLabel class  div.
# for data in contact.find_all('div', class_='infoMultiValue'):
#  cardinfo['Contact information'].append( data.text)

print("---")
print(hubname)
print("---")
pprint(cardinfo)

Ich komm nicht ran an die Info der Beschreibung - Description:

Ich will auf der Seite die Description-Info (also die Beschreibungen) parsen - und gewissermaßen herausziehen. Das allerdings gelingt nicht. Was immer ich mache - ich lande bei den Kontaktinfos.

Die Zielseite ist diese hier: https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view
Und es soll gehen um diese Description _ also die jeweilige Projektbeschreibung.

btw.: über die diversen Urls iterieren - das würd ich ggf. so machen

Python:

import requests
import bs4
import sleep from time
URLs = ['https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view',
        'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/17865/view',
        'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1416/view'
        ]

def getPage(url):
    print('Indexing {0}......'.format(url))
    result = requests.get(url)
    print('Url Indexed...Now pausing 50secs before next ')
    sleep(50)
    return result

results = map(getPage, URLs)
for result in results:
    # soup = bs4.BeautifulSoup(result.text,"lxml")
    soup = bs4.BeautifulSoup(result.text,"html.parser")
    print(soup.find_all('p'))

Das mit dem Iterieren - das sollte grundsätzlich gehen.

playerthreeone · 27. April 2022

"hubCardContent" gibt es halt mehrmals.

Pako1997 · 27. April 2022

Kannst du nicht einfach auf den n-ten 'hubCardContent' zugreifen? Also sowas wie:

Python:

cardinfo = soup.select('"div", class_="hubCardContent"')[2].text

Wobei das ja die 2. hubCardContent sein sollte.

Und stell nächstes mal Python als Code ein, dann hast du das highlighting

tarifa · 27. April 2022

Hallo @Pako1997 und @playerthreeone
vielen Dank für Eure Hinweise. Ich werd das später nochmals genauer ansehen.

Mir wurde auch klar warum Hubcard immer zunächst die Contact-Info "ausgeworfen hat" - weil diese eben ganz oben steht und die erste Info nach dem Hub-Titel ist.

Python:

import requests
from bs4 import BeautifulSoup
from pprint import pprint

url = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html5lib')

# The name of the hub is in the <h4> tag.

hubname = soup.find('h4').text

# All description info is within a <div class='hubCardContent'>.

description = soup.find("div", class_="hubCardContent")

cardinfo = {}

cardinfo = soup.select('"div", class_="hubCardContent"')[2].text

#Grab all the <p> tags inside that div. the infoLabel class marks
#the section header.

for data in description.find_all('p'):
    if 'infoLabel' in data.attrs.get('class', []):
        title = data.text
        cardinfo[title] = []
    else:
        cardinfo[title].append( data.text )

# The description info is in a <div> inside that div.

#for data in description.find_all('div', class_='infoMultiValue'):
#  cardinfo['Contact information'].append( data.text)

print("---")
print(hubname)
print("---")
pprint(cardinfo)

das wirft noch einen Error

Python:

 1552             else:
   1553                 raise ValueError(
-> 1554                     'Unsupported or invalid CSS selector: "%s"' % token)
   1555             if recursive_candidate_generator:
   1556                 # This happens when the selector looks like  "> foo".
ValueError: Unsupported or invalid CSS selector: "class_=hubCardContent"

ich werd mir das morgen nochmals genauer ansehen...
Vor alllen Dingen werde ich mir nochmals den Quellcode genauer ansehen - denn irgendwie denke ich, dass es hier ggf. einfach noch möglich sein muss, die ganze Seite intelligent zu parsen.

Pako1997 · 27. April 2022

Kann gut sein, dass select("class_=hubCardContent") nicht unterstützt wird.

playerthreeone · 28. April 2022

mal so am Rande:
"id=yui_patched_v3_11_0_1_[*]_465"
Die Elemente sind durchnummeriert mit der letzten Stelle.

tarifa · 28. April 2022

Hallo @Pako1997 hallo @playerthreeone
vielen Dank für Eure Rückmeldungen.

gute Gedanken - beide!!!!

Also das wäre ja wirklich super: wenn man das irgendwie so optimieren könnte, dass die Seite mit den fortlaufenden Ids dann einfach zu parsen wäre, das wäre super!

playerthreeone. Wie du so seh ich das auch - mit der Durchnummerierung - allerdings hab ich ich das lediglich bei den Contact-Infos so wieder gewunden:

"id=yui_patched_v3_11_0_1_[*]_465"
Die Elemente sind durchnummeriert mit der letzten Stelle.

Ansonsten: Für den Rest - denk ich kann man auch mit den Tags arbeiten. Aber ggf. ists doch noch besser nach den Strukturierungen zu suchen -

Python:

# Python program to print all heading tags
import requests
from bs4 import BeautifulSoup
 
# scraping a the content
url_link = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view'
request = requests.get(url_link)
 
Soup = BeautifulSoup(request.text, 'lxml')
 
# creating a list of all common heading tags
heading_tags = ["h1", "h2", "h3", "h4"]
for tags in Soup.find_all(heading_tags):
    print(tags.name + ' -> ' + tags.text.strip())

Python:

h1 -> Smart Specialisation Platform
h1 -> Digital Innovation Hubs

Digital Innovation Hubs
h2 -> Bavarian Robotic Network (BaRoN) Bayerisches Robotik-Netzwerk, BaRoN
h4 -> Contact Data
h4 -> Description
h4 -> Link to national or regional initiatives for digitising industry
h4 -> Market and Services
h4 -> Organization
h4 -> Evolutionary Stage
h4 -> Geographical Scope
h4 -> Funding
h4 -> Partners
h4 -> Technologies

Python:

class: hubCard
class: hubCardTitle
class: hubCardContent
class: infoLabel >Description>
<p> Text - data - content <p>

ich schau nochmals nach - denn dein Ansatz mit den Strukturierten Durchnummerierungen - der ist echt sehr hilfreich und vielversprechend

Fazit: wenn man das irgendwie so optimieren könnte, dass die Seite mit den fortlaufenden Ids dann einfach zu parsen wäre, das wäre super!

Euch nochmals vielen Dank!

update: - sieht so aus, als wäre hier die Nummerierung zwischen den diversen Hubcards unterschiedlich

also wenn ich mir die unterschiedlichen Karten - (Seiten) ansehe, dann gibts hier irgendwie inkonsitenzen bzgl. der Nummerierung. Ich denke, dass das dann beim Bau eines Parsers eine Rolle spielt.

https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/3480/view
https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1063/view
https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/13281/view
https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1417/view

vgl. auch hier - wenn man auf die o.g. Nummerierung sieht:

playerthreeone · 28. April 2022

Sowas macht man gewöhnlich um eben das auto. Parsen zu erschweren

tarifa · 28. April 2022

Ja - klar.

Leuchtet ein.

ergo sind ggf. die Alternativen leichter. - die hier:

Python:

Digital Innovation Hubs
h2 -> Bavarian Robotic Network (BaRoN) Bayerisches Robotik-Netzwerk, BaRoN
h4 -> Contact Data
h4 -> Description
h4 -> Link to national or regional initiatives for digitising industry
h4 -> Market and Services
h4 -> Organization
h4 -> Evolutionary Stage
h4 -> Geographical Scope
h4 -> Funding
h4 -> Partners
h4 -> Technologies

ich guck mir das nochmals an... Dir vielen Dank!

Suche

Iterieren über mehrere URLs mit BS4 - Parser

tarifa

Lieutenant

Anhänge

playerthreeone

Banned

Pako1997

Lieutenant

tarifa

Lieutenant

Pako1997

Lieutenant

playerthreeone

Banned

tarifa

Lieutenant

playerthreeone

Banned

tarifa

Lieutenant