• Download from website

    Author
    Topic
    #504352

    A client of mine uses a government website that enables searches by individual last name. The search can take just the first letter of the last name and will return data for every person with a last name beginning with that first letter. There is also an option on the site to download the results to an Excel file.

    Is there a way to automate this process so that Excel can go to the site, do 26 searches one-by-one, and automatically select the download to Excel option or is this a 26-search & download a manual labor of love?

    TIA

    Viewing 5 reply threads
    Author
    Replies
    • #1550476

      Kevin,

      Is it an open website? If so how about a link so we can test. :cheers:

      May the Forces of good computing be with you!

      RG

      PowerShell & VBA Rule!
      Computer Specs

    • #1550513

      Kevin,

      Sorry I can’t see a way to automate this since it only provides the first 25 results on the screen so the standard Excel web page link won’t work. As a further note the search only returns the first 500 records.

      43511-Kevin1

      So even If you could automate it by somehow getting the code to press the web export button how would you know if you got all the results? BTW: I tested the export to Excel and only got 500 records not to mention a warning message in Excel requiring User Action.

      43512-Kevin2

      With my skill set this looks like a strictly user wash rinse repeat operation. Maybe some of the Web gurus will have some insite.

      HTH :cheers:

      May the Forces of good computing be with you!

      RG

      PowerShell & VBA Rule!
      Computer Specs

    • #1550691

      KW,

      I took a look at the source coding for the site. It is a fairly simple site to initiate the search as the elements are easily identifiable. The resulting “results” page would require an extraction of the URL so the code could identify the new page and its content. The new source code is primarily a table with a slew of hyperlinks, one being to download the Excel file. If the search resulted in one page of results, this could be doable.

      So, I was trying to think of a way to achieve that. My initial thought was to filter the searches by systematically sending combinations of letters to the Last name search box. Ex: Aa, Ab, Ac, etc. They all were over 25 making them more than one results page. In fact each combo was over 500. So I thought let me try a combination of 3 letters. Aaa produced over 500 itself. So did Abb. To go deeper to cycle through all the 4 combinations of the alphabet would be overwhelming. My guess is that even then, the searches would top out over 500. This is an enormous database with no doubt Oracle running on the server. Capturing all the data would easily exceed the 1,048,576 row capacity of Excel.

      I think this could have been a feasible venture if the database was not so extensive. I agree with RG, “this looks like a strictly user wash rinse repeat operation.”

      Maud

      • #1551742

        historical records i.e. for past years and months, shouldn’t change. So it doesn’t need to be processed again

        Right, my first thought was a sufficiently restrictive date filter [even down to one day, whatever makes it a doable repetitive job], assuming that it’s only newly added info which needs to be downloaded.

        I wonder would web scraping software help? I plan to implement that whenever I get time, and iirc the software below handle the multi-page results problem. Here are the results of some research I did 3 to 6 months ago:

        The features I’m looking for:
        Desktop software, one-off purchase [+ maybe later upgrades];
        Wizard interface for setting up the queries;
        Run multiple queries in one session;
        Results to Access or Excel;
        Auto scheduling.

        Software for Web Scraping[/url] has a nice list of scrapers. Here are the ones catching my eye:

        Easy Web Extract $60
        This has all the features above.

        OutWit Hub[/url] $81
        This has all the features above.

        WebHarvy $99

        Excel – Get external data from a Web page

        Good MS article about how to setup web queries and bring the results into Excel.

        WebService () function

        Need to check this new function out. These new functions available in Excel 2013 allow you to retrieve data from a web service directly into a worksheet: EncodeURL(), Webservice and FilterXML()

        Lugh.
        ~
        Alienware Aurora R6; Win10 Home x64 1803; Office 365 x32
        i7-7700; GeForce GTX 1060; 16GB DDR4 2400; 1TB SSD, 256GB SSD, 4TB HD

    • #1550706

      RG & Maud…thanks for all your effort to examine this. Your add’l analysis and conclusion, Maud, is quite shocking about the size of this database. Client will have to staff up to wash, rinse, etc. Maybe add bleach.

      • #1551550

        Due to the the large database size, this has to be an all or nothing proposition. The following suggestion is subject to two huge assumptions:

        1. The web site will allow thousands of searches from the same IP in a short period of time
        2. The client can find a developer to properly implement the scheme

        Maudibe had the right idea: partitioning the data. I chose to focus on the other filters, since the initial letter really can’t be extended. After refining each filter, I checked against the letter A. Each result returned over 500 until I got to the last filter: time range. Arbitrarily, I chose 01/01/2006 – 01/31/2006.

        Finally! 3 results!

        43573-lounge-md-search

        Upon further reflection, this is a good range, as the number returned is less than a page. (If the developer can use sockets, this will be irrelevant.)

        At the worst, nearly 20,000 searches have to be done for each time range. Here is a summary:

        Partitioning Scheme:

        26 Initial: Must use only one letter, due to single letter entities like A & L
        24 County: Maryland has 23 counties and one independent city
        4 Case Type: Civil, Criminal, Traffic, Civil Citation
        2 Court Type: Circuit, District
        4 Party Type: Attorney, Defendant, Plaintiff, Other
        ?? Time Range: 1 month

        So for each time range, create 26 * 24 * 4 * 2 * 4 = 19,968 searches.

        Optimizations are possible. OP didn’t mention whether client was interested in the entire database. For example, if only one county is required, the searches per time range would go down to 832!

    • #1551558

      Hi

      This is a good approach. I too managed to get the results filtered to a manageable number by using a monthly calendar range.
      On a positive note, historical records i.e. for past years and months, shouldn’t change. So it doesn’t need to be processed again when you have fetched prior year data etc etc.
      Perhaps using a vba loop for the months date range etc etc and running overnight might be a possibility.

      zeddy

    • #1551759

      Thinking a little laterally, have you contacted the Maryland Judiciary to see whether, for instance:
      1. They can let you have a copy of the database (or a subset) which is, after all, in the public domain ?
      2. They can give you another form of access ?

    Viewing 5 reply threads
    Reply To: Download from website

    You can use BBCodes to format your content.
    Your account can't use all available BBCodes, they will be stripped before saving.

    Your information: