Viewed   89 times

for my website, i'd like to add a new functionality.

I would like user to be able to upload his bookmarks backup file (from any browser if possible) so I can upload it to their profile and they don't have to insert all of them manually...

the only part i'm missing to do this it's the part of extracting title and URL from the uploaded file.. can anyone give a clue where to start or where to read?

used search option and (How to extract data from a raw HTML file?) this is the most related question for mine and it doesn't talk about it..

I really don't mind if its using jquery or php

Thank you very much.



Thank you everyone, I GOT IT!

The final code:

$html = file_get_contents('bookmarks.html');
//Create a new DOM document
$dom = new DOMDocument;

//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.

//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');

//Iterate over the extracted links and display their URLs
foreach ($links as $link){
    //Extract and show the "href" attribute.
    echo $link->nodeValue;
    echo $link->getAttribute('href'), '<br>';

This shows you the anchor text assigned and the href for all links in a .html file.

Again, thanks a lot.

Wednesday, November 2, 2022

I use phpQuery. Are you familiar with jQuery? they share the same syntax. You might be concerned about installing a new library, but trust me this library is well worth the extra over head


You can then access it like this:

foreach($doc->find('p') as $element){
   $element = pq($element);
   echo str_word_count($element->text());
Monday, October 17, 2022

First of all, regex and HTML don't mix. Use:

foreach(DOMDocument::loadHTML($source)->getElementsByTagName('a') as $a)

Links that may go outside your site start with protocol or //, i.e.

href="" is link to a local file.

But if you want to create static copy of a site, why not just use wget?

Monday, December 5, 2022

Flat Loop Example:

  1. You initiate the loop with a stack that contains all URLs you'd like to process first.
  2. Inside the loop:
    1. You shift the first URL (you obtain it and it's removed) from the stack.
    2. If you find new URLs, you add them at the end of the stack (push).

This will run until all URLs from the stack are processed, so you add (as you have somehow already for the foreach) a counter to prevent this from running for too long:

$URLStack = (array) $parent_Url_Html->getHTML()->find('a');
$URLProcessedCount = 0;
while ($URLProcessedCount++ < 500) # this can run endless, so this saves us from processing too many URLs
    $url = array_shift($URLStack);
    if (!$url) break; # exit if the stack is empty

    # process URL

    # for each new URL:
    $URLStack[] = $newURL;

You can make it even more intelligent then by not adding URLs to the stack which already exist in it, however then you need to only insert absolute URLs to the stack. However I highly suggest that you do that because there is no need to process a page you've already obtained again (e.g. each page contains a link to the homepage probably). If you want to do this, just increment the $URLProcessedCount inside the loop so you keep previous entries as well:

while ($URLProcessedCount < 500) # this can run endless, so this saves us from processing too many URLs
    $url = $URLStack[$URLProcessedCount++];

Additionally I suggest you use the PHP DOMDocument extension instead of simple dom as it's a much more versatile tool.

Wednesday, August 10, 2022

Try the Beautiful Soup library for Python. It has very simple methods to extract information from an html file.

Trying to generically extract data from webpages would require people to write their pages in a similar way... but there's an almost infinite number of ways to convey a page that looks identical let alone all the conbinations you can have to convey the same information.

Was there a particular type of information you were trying to extract or some other end goal?

You could try extracting any content in 'div' and 'p' markers and compare the relative sizes of all the information in the page. The problem then is that people probably group information into collections of 'div's and 'p's (or at least they do if they're writing well formed html!).

Maybe if you formed a tree of how the information is related (nodes would be the 'p' or 'div or whatever and each node would contain the associated text) you could do some sort of analysis to identify the smallest 'p' or 'div' that encompases what appears to be the majority of the information.. ?

[EDIT] Maybe if you can get it into the tree structure I suggested, you could then use a similar points system to spam assassin. Define some rules that attempt to classify the information. Some examples:

+1 points for every 100 words
+1 points for every child element that has > 100 words
-1 points if the section name contains the word 'nav'
-2 points if the section name contains the word 'advert'

If you have a lots of low scoring rules which add up when you find more relevent looking sections, I think that could evolve into a fairly powerful and robust technique.

[EDIT2] Looking at the readability, it seems to be doing pretty much exactly what I just suggested! Maybe it could be improved to try and understand tables better?

Thursday, August 11, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :