I am relatively new to Stata and I currently have a Reddit dataset in cross-sectional format with each row representing a given Reddit post by a username, and with some usernames posting several times per day while others post only once/twice in the entire dataset.
* Example generated by -dataex-. For more info, type help dataex
clear
input float id str36 username int date
6 "(crash )" 19013
end
format %td date
I am interested in running a Heckman selection model, so I am trying to convert the data into a panel format, I created an ID variable per username as shown below:
egen id = group(username)
Then ran this to declare the data as panel following the guideline here
xtset id date
And I am receiving the following error: "repeated time values within panel" and I am not sure how to solve this because I believe in my case this is not problematic given that it's typical for social media users to post several times within the same day, which my time unit in this dataset.
If I ran the same code without the date
variable, the code works w/out any errors but my understanding is that I need to use both variables for a panel format.
CodePudding user response:
What is said by Stata is correct, and you confirm it. xtset
with an identifier and time variable will only work if each (identifier, time) observation occurs at most once. The only work-arounds to this are to combine or omit observations to match that required pattern -- or to xtset
in terms of identifier alone.
You are clearly right about the data -- repeated posts from individual users on the same day are a fact -- but Stata's rules for panel data aren't negotiable. More positively put, what you are missing out on is applying models that don't make sense for your data structure any way.
There isn't a Stata issue here unless it is misunderstanding what Stata requires or wishing that it did not do that.
CodePudding user response:
You could use a timestamp to handle this. There is usually one available in session data. Just make sure to store it as a double:
. clear
. input byte id int date double ts
id date ts
1. 1 0 0
2. 1 0 1000
3. 1 0 2000
4. end
. format %td date
. format %tc ts
. list, clean noobs
id date ts
1 01jan1960 01jan1960 00:00:00
1 01jan1960 01jan1960 00:00:01
1 01jan1960 01jan1960 00:00:02
. xtset id ts
Panel variable: id (strongly balanced)
Time variable: ts, 01jan1960 00:00:00 to 01jan1960 00:00:02, but with gaps
Delta: .001 seconds
. xtset id date
repeated time values within panel
r(451);
Alternatively, collapse to user x date level if your analysis permits it.