Home > OS >  how to handle a dataset with many NA data?
how to handle a dataset with many NA data?

Time:08-05

Im working with a dataset that contains gas emissions from three different pools; each pool has three lines sowed with rice. Also, I have per pool environmental and multispectral information. The information was collected in the followed way.

  1. The gas data were collected two times per week after fertilization. At the same time, I have environmental data from the collection moment

  2. Also, we have to fly the drone and collect the multispectral information, but this process is done only once per week.

My problem is that I am unsure how to organize the information to analyze it. im organizing the information in the following way:

Example Dataset

|Line|DATE|Sample_num|PLOT|POINT_ID|CH4_d|CO2_d|N2O_d|Tmin|Tmax|Tmedia|NDVI_MEDIAN|NDVI_SUM|NDRE_MEDIAN|NDRE_SUM|
|----|----|----------|----|--------|-----|-----|-----|----|----|------|-----------|--------|-----------|--------|
|1|2022-05-13|M1|P1|C1|2.561|9217.329|0.280|18.7|32.7|22.2|0.852|52032.67|0.449|27418.52|
|2|2022-05-13|M1|P1|C2|2.248|3804.074|4.180|18.7|32.7|22.2|0.861|55855.312|0.457|29455.783|
|3|2022-05-13|M1|P1|C3|2.068|5488.684|0.715|18.7|32.7|22.2|0.836|50113.203|0.440|26383.299|
|4|2022-05-13|M1|P2|C4|1.050|7084.407|2.069|19|31.8|22.1|0.819|36988.91|0.406|18403.488|
|5|2022-05-13|M1|P2|C5|2.441|8352.177|-0.404|19|31.8|22.1|0.821|36792.707|0.423|19026.646|
|6|2022-05-13|M1|P2|C6|3.602|8767.157|0.254|19|31.8|22.1|0.801|33790.348|0.403|17069.98|
|7|2022-05-13|M1|P3|C7|1.65|7284.40|2.690|19.7|32.5|23|0.650|24589.014|0.345|9840.539|
|8|2022-05-13|M1|P3|C8|2.441|9456.177|-0.363|19.7|32.5|23|0.771|20992.707|0.332|9041.342|
|9|2022-05-13|M1|P3|C8|3.402|8521.321|0.254|19.7|32.5|23|0.764|21469.688|0.325|9215.912|
|1|2022-05-15|M2|P1|C1|1.300|11241.382|0.347|19.7|33|22.3|NA|NA|NA|NA|
|2|2022-05-15|M2|P1|C2|0.927|4511.147|3.067|19.7|33|22.3|NA|NA|NA|NA|
|3|2022-05-15|M2|P1|C3|1.977|7458.584|0.673|19.7|33|22.3|NA|NA|NA|NA|
|4|2022-05-15|M2|P2|C4|1.982|7300.527|0.930|19.8|28.6|21.95|NA|NA|NA|NA|
|5|2022-05-15|M2|P2|C5|0.794|7892.752|0.497|19.8|28.6|21.95|NA|NA|NA|NA|
|6|2022-05-15|M2|P2|C6|2.799|7815.358|0.351|19.8|28.6|21.95|NA|NA|NA|NA|
|7|2022-05-15|M2|P3|C7|1.982|7500.527|0.850|19|32|22|NA|NA|NA|NA|
|8|2022-05-15|M2|P3|C8|0.785|7524.452|0.455|19|32|22|NA|NA|NA|NA|
|9|2022-05-15|M2|P3|C9|2.556|8546.253|0.325|19|32|22|NA|NA|NA|NA|
|1|2022-05-20|M3|P1|C1|2.545|9586.231|0.280|18.7|32.7|22.2|0.852|52032.67|0.449|27418.52|
|2|2022-05-20|M3|P1|C2|3.456|9572.256|4.180|18.7|32.7|22.2|0.861|55855.312|0.457|29455.783|
|3|2022-05-20|M3|P1|C3|2.598|5664.321|0.715|18.7|32.7|22.2|0.836|50113.203|0.440|26383.299|
|4|2022-05-20|M3|P2|C4|4.265|8245.222|2.069|19|31.8|22.1|0.819|36988.91|0.406|18403.488|
|5|2022-05-20|M3|P2|C5|5.235|6587.321|-0.404|19|31.8|22.1|0.821|36792.707|0.423|19026.646|
|6|2022-05-20|M3|P2|C6|6.125|75214.321|0.254|19|31.8|22.1|0.801|33790.348|0.403|17069.98|
|7|2022-05-20|M3|P3|C7|1.654|6548.240|2.690|19.7|32.5|23|0.650|24589.014|0.345|9840.539|
|8|2022-05-20|M3|P3|C8|2.444|9587.486|-0.363|19.7|32.5|23|0.771|20992.707|0.332|9041.342|
|9|2022-05-20|M3|P3|C8|3.456|6312.321|0.254|19.7|32.5|23|0.764|21469.688|0.325|9215.912|
|1|2022-05-22|M4|P1|C1|1.300|11241.382|0.325|19.7|33|22.3|NA|NA|NA|NA|
|2|2022-05-22|M4|P1|C2|0.927|4511.147|3.245|19.7|33|22.3|NA|NA|NA|NA|
|3|2022-05-22|M4|P1|C3|1.977|7458.584|0.325|19.7|33|22.3|NA|NA|NA|NA|
|4|2022-05-22|M4|P2|C4|1.982|7300.527|0.965|19.8|28.6|21.95|NA|NA|NA|NA|
|5|2022-05-22|M4|P2|C5|0.794|7892.752|1.256|19.8|28.6|21.95|NA|NA|NA|NA|
|6|2022-05-22|M4|P2|C6|2.799|7815.358|2.325|19.8|28.6|21.95|NA|NA|NA|NA|
|7|2022-05-22|M4|P3|C7|1.982|7500.527|3.254|19|32|22|NA|NA|NA|NA|
|8|2022-05-22|M4|P3|C8|0.785|7524.452|1.255|19|32|22|NA|NA|NA|NA|
|9|2022-05-22|M4|P3|C9|2.556|8546.253|2.335|19|32|22|NA|NA|NA|NA|

As I am organizing it, the dataset looks like a lot of NA data, but the NA doesnt mean the information is lost. It happened because I only take the multispectral images once a week, and the gas sampling is taken at least two or three times per week.

Now my question is, how should I find the correlation among this dataset? I dont think that removing the NA data can be an option because it is part of the experiment, and also, the NA follows the same tendency in all the datasets, there are only some exceptions for problems with the drone, but this is only the 5% or 3% of all the information

CodePudding user response:

I don't know if this will be helpful for you but you could use missingno module in python for visualizing missing values in your table.

  • Related