-
-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[Bug]: The class attribute extracted by crawl4ai is wrong #1841
Copy link
Copy link
Closed
Labels
Description
crawl4ai version
0.8.0
Expected Behavior
We should be able to select the corresponding element using the css selector "a span.fn".
Current Behavior
The following minimal poc is used to describe the problem:
import asyncio
from crawl4ai import AsyncWebCrawler, JsonCssExtractionStrategy
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
schema = {
"name": "minimal reproducer",
"baseSelector": "td.change-author",
"type": "nested_list",
"fields": [
{"name": "field1", "selector": "a span", "type": "text"},
{"name": "field2", "selector": "a span", "type": "attribute", "attribute": "class"},
{"name": "field3", "selector": "a span.fn", "type": "text"},
]
}
async def main():
browser_config = BrowserConfig()
crawler_config = CrawlerRunConfig(extraction_strategy=JsonCssExtractionStrategy(schema))
async with AsyncWebCrawler(config=browser_config) as web_crawler:
result = await web_crawler.arun(
url="https://bugzilla.mozilla.org/show_bug.cgi?id=1770266",
config=crawler_config
)
if result.success:
print(result.extracted_content)
if __name__ == "__main__":
asyncio.run(main())Running the above poc gives the result:
$ python3 minimal_poc.py
[INIT].... → Crawl4AI 0.8.0
[FETCH]... ↓ https://bugzilla.mozilla.org/show_bug.cgi?id=1770266 | ✓ | ⏱: 2.76s
[SCRAPE].. ◆ https://bugzilla.mozilla.org/show_bug.cgi?id=1770266 | ✓ | ⏱: 0.03s
[EXTRACT]. ■ https://bugzilla.mozilla.org/show_bug.cgi?id=1770266 | ✓ | ⏱: 0.03s
[COMPLETE] ● https://bugzilla.mozilla.org/show_bug.cgi?id=1770266 | ✓ | ⏱: 2.83s
[
{
"field1": "Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)",
"field2": [
"fna"
]
},
{
"field1": "Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)",
"field2": [
"fna"
]
},
...We cannot extract the corresponding element using css selector "a span.fn". Instead, by inspecting the class attribute of "a span", we find that it contains a weird value "fna" instead of "fn".
Is this reproducible?
Yes
OS
macOS
Python version
3.14.3
Browser
Chrome
Browser version
No response
Error logs & Screenshots (if applicable)
No response
Reactions are currently unavailable