I have a column which comprises ip addresses. Now I need to parse them to contries/cities:
select IPUtils('199.999.999.999')
and it returns ['Aisa', 'Hongkong', 'xxx', 'Hongkong']
I write a hive udf to do this but it runs exetremely slow, as shown below:
INFO : 2021-09-08 18:51:10,817 Stage-2 map = 100%, reduce = 30%, Cumulative CPU 9074.06 sec
map = 100%
while progress of reduce
gains 1 percent every 15 minutes.
The UDF reads file from the project's resource folder, so mayby it repeatly read the file again and again? The udf is shown as below, any help is appreciated:
public class IPUtil extends UDF {
public List<String> evaluate(String ip){
try{
ClassLoader classloader = Thread.currentThread().getContextClassLoader();
// I put the mmdb file in resource folder of the java project
InputStream is = classloader.getResourceAsStream("GeoLite2-City.mmdb");
DatabaseReader reader = new DatabaseReader.Builder(is).build();
InetAddress ipAddress = InetAddress.getByName(ip);
CityResponse response = reader.city(ipAddress);
Country country = response.getCountry();
Subdivision subdivision = response.getMostSpecificSubdivision();
City city = response.getCity();
Continent continent = response.getContinent();
List<String> list = new LinkedList<String>();
list.add(continent.getNames().get("zh-CN"));
list.add(country.getNames().get("zh-CN"));
list.add(subdivision.getNames().get("zh-CN"));
list.add(city.getNames().get("zh-CN"));
return list;
} catch (UnknownHostException e) {
e.printStackTrace();
return null;
} catch (IOException e) {
e.printStackTrace();
return null;
} catch (GeoIp2Exception e) {
e.printStackTrace();
return null;
}
}
@Test
public void test()throws Exception{
System.out.println(evaluate("175.45.20.138"));
}
}
CodePudding user response:
Move this
InputStream is = classloader.getResourceAsStream("GeoLite2-City.mmdb");
DatabaseReader reader = new DatabaseReader.Builder(is).build();
to the class initialization.