Extracting data using CSS selectors in jsoup is one of the most powerful ways to scrape web pages in Java. It utilizes the Element.select(String cssSelector) method, which works similarly to JavaScript’s querySelectorAll() to find and filter elements dynamically. 1. The Core Method: select()
The select() method can be called on a Document, an individual Element, or a collection of Elements. It always returns an Elements object (a list-like collection of matching nodes).
// Parse an HTML document Document doc = Jsoup.parse(htmlString); // Select all elements matching your CSS selector Elements items = doc.select(“div.article-card”); Use code with caution. 2. Common CSS Selector Patterns
The jsoup Selector API supports standard CSS3 pattern matching: Description tag Selects by HTML tag name doc.select(“p”) All paragraph elements #id Selects by unique ID attribute doc.select(“#main-nav”) The element with id=“main-nav” .class Selects by class name doc.select(“.author-name”) All elements containing that class [attr] Elements with a specific attribute doc.select(“a[href]”) All anchor tags that have links [attr=val] Matches exact attribute value doc.select(“img[alt=Logo]”) Image tags where alt text is exactly “Logo” [attr\(=val]</code> Attribute ends with a value <code>doc.select("img[src\)=.png]“) Image tags pointing to PNG files 3. Combining and Nesting Selectors
You can combine patterns to drill down into complex HTML trees:
Descendant Selector (Space): Finds any matching child, grandchild, etc. doc.select(“div.sidebar a”) finds all tags anywhere inside
Direct Child Selector (>): Finds immediate children only. doc.select(“ul.menu > li”) targets only
Multiple Classes: Chain classes together without spaces. doc.select(“button.btn.btn-success”) matches elements that have both classes. 4. Extracting the Data
Finding the elements is only the first step. You then use jsoup data extraction methods to pull the actual text or attributes out of the matched targets:
Extract Text: Use .text() to get the combined text inside an element and its children.
Extract Attributes: Use .attr(“attribute_name”) to get values like URL links (href) or source files (src).
Extract Inner HTML: Use .html() if you need to retain the inner raw HTML markup. Complete Implementation Example
import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class Scraper { public static void main(String[] args) { String html = “
”; Document doc = Jsoup.parse(html); // 1. Get text from a single ID element String pageTitle = doc.select(“#title”).text(); System.out.println(“Page Title: ” + pageTitle); // 2. Iterate through elements and extract links/text Elements products = doc.select(“.container .product”); for (Element product : products) { // Contextual selection inside the current product element String productName = product.select(“.name”).text(); String productLink = product.select(“a”).attr(“href”); System.out.println(“Product: ” + productName + “ -> URL: ” + productLink); } } } Use code with caution.
If you are dealing with large or heavily structured web pages, would you like me to show you how to handle advanced pseudo-selectors (like :eq() or :contains()), or are you looking to connect to a live URL to pull HTML directly? Use CSS selectors to find elements – jsoup Java HTML Parser
Leave a Reply