Get the contents from a URL - Which settings should I use?

Preben · September 15, 2023, 9:14am

So, I was just made aware of the “Send HTTP request” piece, and it can read a URL and get the content using the GET function. Very well

But I struggle.

It gets all the content in the , but I only want the content without the CSS, javascript, head etc.

How can I tell this piece to clean up the content so I only have the text content?

Or use another piece to do just that?

I want to use this in OpenA for further work, but for now I can’t because the whole code from the URL is far too big. Would like to do a summary, get the headlines etc.

ashrafsam · September 15, 2023, 3:10pm

You can parae the content with a Code step, @abuaboud and @thisthatjosh have worked on something like this before. Maybe they can share the code or a sample flow here.

I also think we need a piece to do HTML parsing and I think I’ll start advocating for it soon within the team, many use cases need this (especially monitors).

ashrafsam · September 15, 2023, 3:11pm

Until someone else replies, can you post a sample URL that you are trying to read and parse? That will give better context on what’s needed to be done

GunnerJnr · September 15, 2023, 3:42pm

Hi @Preben,

You have 2 options here; both involve utilising the codepiece!

Install some npm package, such as cheerio. npm install cheerio. Then, you can utilise it to extract a specific section from the page data gathered in the GET request.

For Example:

If the web page’s main content has a div with the id of #main-content for example then you could do something like the following:

export const code = async (params) => {
    // Import the 'cheerio' library
    const cheerio = require('cheerio');

    // Loading the HTML data fetched (assuming it is stored in params.htmlString in your Key:Value) into Cheerio to facilitate the parsing and manipulation of the HTML structure.
    // The '$' is a reference to the cheerio instance, loaded with the HTML data, ready to query the DOM elements just like in jQuery.
    const $ = cheerio.load(params.htmlString);

    // Selecting the element with the ID 'main-content' from the loaded HTML data using the '$' function, which is similar to the jQuery selector function.
    // The '.html()' function is then called to get the inner HTML of the selected element as a string.
    const mainContent = $('#main-content').html();

    // Returning the inner HTML of the 'main-content' section as a string.
    // If the '#main-content' element is not found in the HTML data, 'null' will be returned.
    return mainContent;
};

or if you prefer plain JavaScript, then you MIGHT be able to use the DOM API to extract the content, with something like the following:

export const code = async (params) => {
    // We start by creating a new DOMParser object.
    const parser = new DOMParser();

    // Next, we use our DOMParser tool to read the HTML data (which is stored in params.htmlString (Your Key:Value pair) and turn it into a format (a Document object) that allows us to easily find and work with different parts of the webpage.
    const doc = parser.parseFromString(params.htmlString, 'text/html');

    // Now, we ask our Document object (which represents the webpage) to find the section that has an ID of 'main-content'.
    const mainContent = doc.querySelector('#main-content');

    // Here, we check if we successfully found the 'main-content' section. If we find it, we take all the content from that section. If not, we say 'null', which means we didn't find anything.
    const mainContentHTML = mainContent ? mainContent.innerHTML : null;

    // Finally, we give back the content we found (or 'null' if we didn't find anything) so it can be used later.
    return mainContentHTML;
};

I haven’t tested so you might need to fiddle around with it, and obviously you will need to change the div or html tag to the class or id that you want to target. I did something similar to extract the meta description from the page:

export const code = async (params) => {
    // Create a new DOMParser object, a tool that helps us to read and understand the structure of a webpage.
    const parser = new DOMParser();

    // Use the DOMParser to read the HTML data (stored in params.htmlString) and convert it.
    const doc = parser.parseFromString(params.htmlString, 'text/html');

    // Find the meta element with the name attribute set to "description" to get the meta description.
    const metaDescriptionElement = doc.querySelector('meta[name="description"]');

    // Get the content attribute of the meta description element to obtain the actual description text. If the element doesn't exist, we return null, indicating that no description was found.
    const metaDescription = metaDescriptionElement ? metaDescriptionElement.getAttribute('content') : null;

    // Return the meta description text (or null if no description was found) so it can be used later.
    return metaDescription;
};

Hope this helps.

Kind Regards

thisthatjosh · September 15, 2023, 6:24pm

This was the code @abuaboud


export const code = async (inputs: any) => {
    try {
        // Use fetch to make a request to the website
        const response = await fetch(inputs.url);

        // Check if the request was successful
        if (!response.ok) {
            throw new Error('Failed to fetch the website');
        }

        // Convert the response to text
        const html = await response.text();

        // Load the HTML into Cheerio
        const $ = cheerio.load(html);

        // Remove images from the HTML
        $('img').remove();

        // Remove script tags from the HTML
        $('script').remove();

        // Remove style tags from the HTML
        $('style').remove();

        // Remove inline style attributes from elements
        $('[style]').removeAttr('style');

        // Get the text content without images, HTML tags, JavaScript, and CSS
        const textContent = $('body').text();

        // The output is the text content without images, HTML tags, JavaScript, CSS
        return textContent;
    } catch (error) {
        // Handle any errors that might occur during the fetch or processing
        console.error('Error:', error.message);
        return null; // Or return an appropriate error response
    }
};```

ashrafsam · September 15, 2023, 6:32pm

I made a change to your post @thisthatjosh to wrap the code with ``` to render it as a whole code block.

Thanks for sharing!

Preben · September 16, 2023, 6:45am

@thisthatjosh and @ashrafsam , sorry for beeing a noob at ActivePieces and my confusion

Exactly where do I put this code? Which piece should I use for this?

You mention to use Cheerio, which I can’t find and I don’t understand how I can install it neither.

@GunnerJnr , I am using the hosted version of AP, so I think I can’t install things here? Am I right?

Thank you all for helping out!

GunnerJnr · September 16, 2023, 7:02am

Hi @Preben,

Just expand the code window, and you’ll be greeted with more options as can be seen below.

codepiece-min

Kind Regards

system · September 22, 2023, 2:44pm

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.