Home > Net >  Can I use multithreading and parallel programming for web scraping?
Can I use multithreading and parallel programming for web scraping?

Time:10-06

I having a hard time understanding multithreading and parallel programming. I have a small application (Scraper). I am using Selenium with C# .NET. I have a file that contains addresses from business. I then use my scraper to look for company name and their website. After that I do another scraping for generic email address based on their company site

Here is the issue. If I do this manually it will take me 3 years to complete a 50,000 records. I made the math. Lol. That's why I created the scraper. A normal console application took 5 to 6 days to complete. Then, I decided maybe using multithreading and parallel programming could reduce the time.

So, I did a small sample test. I noticed that 1 record took 10 sec. To finish. Then with 10 record it took 100 sec. My question is why multithreading took the same time?

I am not sure if my expectations and understanding of multithreading is wrong. I thought by using Parallel.ForEach will launch all ten record and finish at 10 sec saving me 90 sec. Is this the correct assumption? Can someone please clarify me how actually multithreading and parallel programming works?

private static List<GoogleList> MultiTreadMain(List<FileStructure> values)
{
        List<GoogleList> ListGInfo = new List<GoogleList>();
        var threads = new List<Thread>();
        Parallel.ForEach (values, value =>
        {
            if (value.ID <= 10)
            {
                List<GoogleList> SingleListGInfo = new List<GoogleList>();
                var threadDesc = new Thread(() =>
                {
                   lock (lockObjDec)
                   {
                      SingleListGInfo = LoadBrowser("https://www.google.com", value.Address, value.City, value.State,
                                 value.FirstName, value.LastName,
                                 "USA", value.ZipCode, value.ID);
                        SingleListGInfo.ForEach(p => ListGInfo.Add(p));
                    }
                });
                threadDesc.Name = value.ID.ToString();
                threadDesc.Start();
                threads.Add(threadDesc);

            }
        });

        while (threads.Count > 0)
        {
            for (var x = (threads.Count - 1); x > -1; x--)
            {
                if (((Thread)threads[x]).ThreadState == System.Threading.ThreadState.Stopped)
                {
                    ((Thread)threads[x]).Abort();
                    threads.RemoveAt(x);
                }
            }
            Thread.Sleep(1);
        }
     

       return ListGInfo;
}

CodePudding user response:

This is probably not the answer to the specific problem you are facing, but it might be a hint to the general question "why isn't multithreading faster". Let's say that the Selenium has a public class EdgeDriver which is implemented like this:

public class EdgeDriver
{
    private static object _locker = new();

    public void GoToUrl(string url)
    {
        lock (_locker)
        {
            GoToUrlInternal(url);
        }
    }

    internal void GoToUrlInternal(string url) //...
}

You, as a consumer of the class, cannot see the private _locker field or the internal methods. These are implementation details, hidden from you, and the only way to know what this class is doing is by reading the documentation. So if the implementation looks like the above contrived example, any attempt to speed up your program by creating multiple EdgeDriver instances and invoking their GoToUrl method in a Parallel.ForEach loop, will be for naught. The lock on a static object will ensure that only one thread at a time will be allowed to invoke the GoToUrlInternal, and all the other threads will have to wait for their turn. This is called "the calls are serialized". And that's just one of the many possible reasons why multithreading may not be faster than code running on a single thread.

CodePudding user response:

I hope the below code snippet will give you some direction. I am dividing the work between records in List of FileStructure. Based on the problem statement I don't think there is a necessity for a lock here

private static List<GoogleList> MultiTreadMain(List<FileStructure> values)
{
    var tasks = new List<Task<List<GoogleList>>>();
    var toBeScraped = values.Where(p => p.Id >= 10);
    Parallel.ForEach (toBeScraped, value =>
    {
        Task<List<GoogleList>> task = Task<List<GoogleList>>.Factory.StartNew(() =>
        {
            return ProcessRequestAsync(value);
        });
        tasks.Add(task);
    });

    var mergedTask = Task.WhenAll(tasks);
    List<GoogleList> ListGInfo = new List<GoogleList>();
    
    foreach(var item in mergedTask.GetAwaiter().GetResult())
    {
        ListGInfo.AddRange(item.GetAwaiter().GetResult());
    }

   return ListGInfo;
}

public static List<GoogleList> ProcessRequestAsync(FileStructure value)
{
     List<GoogleList> SingleListGInfo = new List<GoogleList>();
     SingleListGInfo = LoadBrowser("https://www.google.com", value.Address, value.City, value.State,
                         value.FirstName, value.LastName,
                         "USA", value.ZipCode, value.ID);
     SingleListGInfo.ForEach(p => ListGInfo.Add(p));
     return SingleListGInfo;
}
  • Related