So, I was just made aware of the “Send HTTP request” piece, and it can read a URL and get the content using the GET function. Very well
But I struggle.
It gets all the content in the , but I only want the content without the CSS, javascript, head etc.
How can I tell this piece to clean up the content so I only have the text content?
Or use another piece to do just that?
I want to use this in OpenA for further work, but for now I can’t because the whole code from the URL is far too big. Would like to do a summary, get the headlines etc.
You can parae the content with a Code step, @abuaboud and @thisthatjosh have worked on something like this before. Maybe they can share the code or a sample flow here.
I also think we need a piece to do HTML parsing and I think I’ll start advocating for it soon within the team, many use cases need this (especially monitors).
You have 2 options here; both involve utilising the codepiece!
Install some npm package, such as cheerio. npm install cheerio. Then, you can utilise it to extract a specific section from the page data gathered in the GET request.
For Example:
If the web page’s main content has a div with the id of #main-content for example then you could do something like the following:
export const code = async (params) => {
// Import the 'cheerio' library
const cheerio = require('cheerio');
// Loading the HTML data fetched (assuming it is stored in params.htmlString in your Key:Value) into Cheerio to facilitate the parsing and manipulation of the HTML structure.
// The '$' is a reference to the cheerio instance, loaded with the HTML data, ready to query the DOM elements just like in jQuery.
const $ = cheerio.load(params.htmlString);
// Selecting the element with the ID 'main-content' from the loaded HTML data using the '$' function, which is similar to the jQuery selector function.
// The '.html()' function is then called to get the inner HTML of the selected element as a string.
const mainContent = $('#main-content').html();
// Returning the inner HTML of the 'main-content' section as a string.
// If the '#main-content' element is not found in the HTML data, 'null' will be returned.
return mainContent;
};
or if you prefer plain JavaScript, then you MIGHT be able to use the DOM API to extract the content, with something like the following:
export const code = async (params) => {
// We start by creating a new DOMParser object.
const parser = new DOMParser();
// Next, we use our DOMParser tool to read the HTML data (which is stored in params.htmlString (Your Key:Value pair) and turn it into a format (a Document object) that allows us to easily find and work with different parts of the webpage.
const doc = parser.parseFromString(params.htmlString, 'text/html');
// Now, we ask our Document object (which represents the webpage) to find the section that has an ID of 'main-content'.
const mainContent = doc.querySelector('#main-content');
// Here, we check if we successfully found the 'main-content' section. If we find it, we take all the content from that section. If not, we say 'null', which means we didn't find anything.
const mainContentHTML = mainContent ? mainContent.innerHTML : null;
// Finally, we give back the content we found (or 'null' if we didn't find anything) so it can be used later.
return mainContentHTML;
};
I haven’t tested so you might need to fiddle around with it, and obviously you will need to change the div or html tag to the class or id that you want to target. I did something similar to extract the meta description from the page:
export const code = async (params) => {
// Create a new DOMParser object, a tool that helps us to read and understand the structure of a webpage.
const parser = new DOMParser();
// Use the DOMParser to read the HTML data (stored in params.htmlString) and convert it.
const doc = parser.parseFromString(params.htmlString, 'text/html');
// Find the meta element with the name attribute set to "description" to get the meta description.
const metaDescriptionElement = doc.querySelector('meta[name="description"]');
// Get the content attribute of the meta description element to obtain the actual description text. If the element doesn't exist, we return null, indicating that no description was found.
const metaDescription = metaDescriptionElement ? metaDescriptionElement.getAttribute('content') : null;
// Return the meta description text (or null if no description was found) so it can be used later.
return metaDescription;
};
export const code = async (inputs: any) => {
try {
// Use fetch to make a request to the website
const response = await fetch(inputs.url);
// Check if the request was successful
if (!response.ok) {
throw new Error('Failed to fetch the website');
}
// Convert the response to text
const html = await response.text();
// Load the HTML into Cheerio
const $ = cheerio.load(html);
// Remove images from the HTML
$('img').remove();
// Remove script tags from the HTML
$('script').remove();
// Remove style tags from the HTML
$('style').remove();
// Remove inline style attributes from elements
$('[style]').removeAttr('style');
// Get the text content without images, HTML tags, JavaScript, and CSS
const textContent = $('body').text();
// The output is the text content without images, HTML tags, JavaScript, CSS
return textContent;
} catch (error) {
// Handle any errors that might occur during the fetch or processing
console.error('Error:', error.message);
return null; // Or return an appropriate error response
}
};```