Ah, I wasn’t aware I could use it this way @thisthatjosh . I was wrong! Cool!
How can we tell it to only retrieve the content from the site, and not all the excess HTML? Like not any CSS, not any info from etc
I made a test flow, and the content is too long. I get the
> message": "This model's maximum context length is 16385 tokens. However, your messages resulted in 22339 tokens. Please reduce the length of the messages.",
You have 2 options here; both involve utilising the codepiece!
Install some npm package, such as cheerio. npm install cheerio. Then, you can utilise it to extract a specific section from the page data gathered in the GET request.
For Example:
If the web page’s main content has a div with the id of #main-content for example then you could do something like the following:
export const code = async (params) => {
// Import the 'cheerio' library
const cheerio = require('cheerio');
// Loading the HTML data fetched (assuming it is stored in params.htmlString in your Key:Value) into Cheerio to facilitate the parsing and manipulation of the HTML structure.
// The '$' is a reference to the cheerio instance, loaded with the HTML data, ready to query the DOM elements just like in jQuery.
const $ = cheerio.load(params.htmlString);
// Selecting the element with the ID 'main-content' from the loaded HTML data using the '$' function, which is similar to the jQuery selector function.
// The '.html()' function is then called to get the inner HTML of the selected element as a string.
const mainContent = $('#main-content').html();
// Returning the inner HTML of the 'main-content' section as a string.
// If the '#main-content' element is not found in the HTML data, 'null' will be returned.
return mainContent;
};
or if you prefer plain JavaScript, then you MIGHT be able to use the DOM API to extract the content, with something like the following:
export const code = async (params) => {
// We start by creating a new DOMParser object.
const parser = new DOMParser();
// Next, we use our DOMParser tool to read the HTML data (which is stored in params.htmlString (Your Key:Value pair) and turn it into a format (a Document object) that allows us to easily find and work with different parts of the webpage.
const doc = parser.parseFromString(params.htmlString, 'text/html');
// Now, we ask our Document object (which represents the webpage) to find the section that has an ID of 'main-content'.
const mainContent = doc.querySelector('#main-content');
// Here, we check if we successfully found the 'main-content' section. If we find it, we take all the content from that section. If not, we say 'null', which means we didn't find anything.
const mainContentHTML = mainContent ? mainContent.innerHTML : null;
// Finally, we give back the content we found (or 'null' if we didn't find anything) so it can be used later.
return mainContentHTML;
};
I haven’t tested so you might need to fiddle around with it, and obviously you will need to change the div or html tag to the class or id that you want to target. I did something similar to extract the meta description from the page:
export const code = async (params) => {
// Create a new DOMParser object, a tool that helps us to read and understand the structure of a webpage.
const parser = new DOMParser();
// Use the DOMParser to read the HTML data (stored in params.htmlString) and convert it.
const doc = parser.parseFromString(params.htmlString, 'text/html');
// Find the meta element with the name attribute set to "description" to get the meta description.
const metaDescriptionElement = doc.querySelector('meta[name="description"]');
// Get the content attribute of the meta description element to obtain the actual description text. If the element doesn't exist, we return null, indicating that no description was found.
const metaDescription = metaDescriptionElement ? metaDescriptionElement.getAttribute('content') : null;
// Return the meta description text (or null if no description was found) so it can be used later.
return metaDescription;
};